Indexing well related data according to the CDA Taxonomy
In the UK, the CDA has a collection of 450,000 well and seismic related documents reported by the UKCS operators since the early days of the North Sea O&G exploration and production.
In August 2016, AgileDD received a copy of this large collection of unstructured documents in the frame of the CDA Unstructured Data Challenge. We noticed that the way the CDA members have indexed this large amount of unstructured documents was very efficient and accurate. Therefore, we trained iQC to use a similar taxonomy to check if the result of an automatized indexing could be compared to an indexing done manually over the time.
The CDA taxonomy allows to index the well related documents according to the well-bore name, a type and a sub-type. The type is an indication of the document format: report, log image, digital file (DWL,WDD ..) and the sub-type is a description of the content: wireline log, VSP, engineering ...
Some of the sub-categories are very close each other like the wireline and MWD and before to run the first text, we were anxious about the possibilities for iQC to be as performant as a data expert to distinguish between very close categories.
The observed result was excellent, iQC has proved to be as good as a data expert to classify the well related documents in more than 80% of the case. This ratio was observed increasing while we increased the training and the few remaining discrepancies were frequently due to an erroneous manual indexing.
In the display on the left, the automatic indexing of the documents related to the well 132/6-1 is 100% in line with the manual indexing. iQC has proved its ability to distinguish between scan of report and log image, between wireline and MWD and even between pre-drilling reports and the post-drilling report whatever the document file format or size.
Extracting metadata from the well related documents
According to the Gartner research report, How Metadata Improves Business Opportunities and Threats, “Metadata unlocks the value of data and, therefore, requires management attention.”
And since we know at AgileDD how much O&G companies are data driven, we know how the metadata could be an hot topic for our customer. Therefore our ambition is to support our industry extracting efficiently, easily and with a minimum cost the metadata which will improve your business and move you ahead of your competitors.
In a recent post, Ian McPherson noticed 4 ways metadata delivers value to the Energy business:
Streamlined business workflows and increased productivity
Quality control of your information to ensure accuracy and compliance with industry regulations
Consistent file information, even when a collection features files from a variety of sources, such as an acquisition, or from multiple vendors
Accurate file retention and archiving.
And you may have experienced some others! But all these business benefits are accessible ... at the condition you can detect, extract and store the metadata you need, at the time you needs for the current business you have to solve.
Until recently, extracting metadata was a very labor intensive and sensitive task done by skilled data expert able to detect the parameters of the validated lithostrat column, the casing column, the wireline or MWD logs measured in the well or the mud column characteristics among hundred of documents related to the same well and sometime containing different versions of the same reality. The effort to be deployed to get the ball in a rugby scrum could be a good illustration of the effort you have to deploy to extract a bunch of metadata from adverse well related documents! In addition, if you business objectives are changing, it obliged you to go back to your unstructured documents and extract again a new set of metadata with the same effort.
By developing iQC, a machine learning technology to automatize the documents cataloging (the art of detecting and extracting metadata), AgileDD has really decreased the necessary effort to run this task efficiently.
But not only iQC make possible an automatic extraction of the metadata targeted by the users from unstructured documents but in addition it can evaluate the confidence level of each extraction. This is done by comparing the past experience of extracting similar metadata in the documents with a particular extraction.
As a result, the iQC user interface is able to display the metadata values associated with its confidence factor using a colored flag to encourage the data specialist to focus on the "red" metadata only when QCing the extraction results.
Since the metadata are extracted on all processed documents by iQC and a confidence factor estimated, it is possible to associate the best metadata candidate to a particular wells among tens of candidates. But iQC will manage all the metadata extracted from all the documents, not only the best one. That makes possible to display the variability of values for a particular metadata related to a particular well. In the example displayed on the left, it makes it clear that 527.00 is the more probable value for the water depth (green rectangle) but some other values so exist for the water depth of this particular well among all the unstructured documents associated to it.
The iQC interface allow to investigate this variability very rapidly and very intuitively. Clicking on a value, display immediatly the metadata within the document. And it could be done for any value of the metadata. In other term, iQC allow you not only to detect the well metadata but also to "source" your information to make it more trustable.
QCing the results and improving the Learning Model at the same time
In the introduction of a book named "Machine Learning" written in 1997, Tom Mitchell from Carnegie-Mellon University wrote that Machine Learning application improves with experience:
If a classical application contents all the knowledge to perform a particular task from day one, a Machine Learning application don't but has the capacity to LEARN, has the capacity to establish inferences from data and reuse this knowledge afterwards with the expectation to improve its skills performing a task.
Guided by a SME (Subject Matter Expert), iQC can learn by several different ways. In order to make this teaching easy and efficient, an advanced GUI (Graphic User Interface) has been designed.
At first, the user has the possibility to define the taxonomy to be used for documents classification and to define the metadata to be extracted with some of their properties. That is done in the admin interface.
With this basic information, it is possible to run a first batch and observe the results. Thanks to the evaluation of the confidence for each metadata detection and the fact each detected metadata is located accurately within each source document, the SME can focus rapidly to the less probable detection and teach iQC how to improve its detection tools.
A first option is to validate a result. Doing that, iQC will remember the validated result, the surrounding context of this result and create an inference between the two in the learning model. Next time a similar context is found in a document, the metadata will be detected with a higher probability.
A second possibility is to refute the result. In this case, a negative inference will be established between the metadata and the surrounding context to avoid such a detection in the future.
It is also possible to delete a result without explaining the machine the reason.
To make the validation and refutation fast, it is possible to use a specific interface to go rapidly through all the source documents
Finally, it is possible for the SME to be more explicit while training and to select graphically the location of the metadata to be detected. Doing such explicit training a several times for a new metadata to be automatically extracted is the more secure way to define a learning model for a particular metadata.