At AgileDD, one of the squirrels could be considered as a stakhanovist hackathonist! It's Amit, our kaggle champion who is a star in the time serie championships. But we should admit that the other squirrels, including myself, don't have this experience. So, when AgileScientific and TOTAL proposed to organize an hackathon in Paris during the week-end just before EAGE (June 10th and 11th 2017) dedicated to geoscience and machine learning, I jump into it. And my first conclusion is that I want to do some others! This post will try to explain the multiple reasons why I feel the last weekend was so enjoyable.
No doubt the first reason comes from the team I met and I worked with during these intense 2 days. Before the event, I didn't spend time on the the hackathon dedicated slack (Slack is a collaborative social network used by hackers to work and share their findings ... an additional way to received additional notifications but Matt insured me that the information / noise ratio uses to be good, so ...) to create a team, to prepare data and to be ready for day 1. Instead, I just came with some few ideas of what it could be done and an handful of unstructured oil well related documents. By chance, around a coffee during the welcome session, I met a team of 3 master students of the Leeds university interested by applying ML to data management.
Their names and linkedIn page is just below.
Adam Goddard, Jack Woollam, Daniel Stanton
At 10:00 on Saturday morning, our project was cast: "Indexing tiff log images according their graphic content" and a first objective was targeted: "Sorting the log images according if they have a lithological column or not."
On paper, it should not be too complicated. The open source contains a lot of CNN (Convolutional Neural Network) implementations to distinguish a dog pic to a cat pic, the face of your girl friend to the face of her mother... Could that be reusable to distinguish the plot of a raw wireline log versus a composite plot with a litho column saved as a tif image?
It could appear just as something funny to do, but I think it may have a lot of consequences. If it works, it means that millions of technical legacy documents saved in any organisations can be indexed according their graphic content and queries such as "I want all the documents with a core image" or "list me the documents with a synthetic seismogram" are possible without manual indexing. We can also imagine to go more to the details and index the documents according each drawn lithology and request the documents having a dolomite or an evaporite interval. As you can see, it open a lot of perspectives ... and for the time being, we got a name and a logo for our project: Logs on the Rocks!
The second nice thing with this hackathon was that you din't work alone with the team. Some helpers were available and the first one who came to help us was Victor Zaytsev from TOTAL. Victor is a talented mathematician and statistician and he gave us his view on how our problem could be solved. It confirmed it was achievable over the two days, and according to him, our problem looked like something as it follows:
OK, but even if the team followed the Python bootcamp on Friday, we didn't feel comfortable starting creating a notebook with some Python and TensorFlow inside even if we all agree it should be done this way to really control way we do. Since our purpose was just to make a PoC, could we find an other solution, if possible not source code intensive?
Here came our second and great advisor, Francois Courteille, a senior solution architect at NVIDIA.
Francois explained to us the capabilities of the DIGITS, one of the NVIDIA "qwiklabs" in the cloud which could help us.
This lab is really cool. It offers a user friendly interface to use several deep learning algorithms and to test the associated parameters. Every things we needed! In addition, you may run 5 sessions of 2 hours per day free of charge. Knowing we were four ... It was even more than what we needed!
In addition of the team and the supports, the third good reason which made the hackathon enjoyable the "creative thinking spirit". During 2 days, you are out of your emails, there is no meeting and on the top of that you are in an environment specifically designed by TOTAL to encourage the creativity. They name this space the "Booster" and they use it to help their team initiating their project out of the silo were they use to live. The kitchen is large, plenty of coffee, croissant, they are large coffins to talk with the people, a lot of open spaces and private spaces with large screens to work in small group... really nice.
Is it due to this creative atmosphere or only due to the well known genius of the Leeds university? I don't know, but the team was very fast to have an incredible good idea to solve an issue all the teams face in any machine learning project: How to generate a sufficient number of training data in a minimum of time. The NVIDIA DIGITS lab required tagged square pics as training data ... and a log image, with rocks or without is everything but square! It is around 21 cm width and may be several meters long! Neverthelessm they found a way to shoot two birds with one stone and multiply the training data while conforming the DIGITS constrain:
1 - Tag the logs (It was my job, tagging 120 log images as "with rock" or "without rock" ... was not too difficult )
2 - Shrink all the log images to a width of 500 pixels
3 - cut the log image in intervals of 500 pixels
4 - remove the 2 first which use to content the log header, something we are not interested in.
In less time I use to explain the process, they scripted it and produced around 2000 squared and tagged pics, half of them with rocks.
Will the training set be sufficient? Should we move all the pics to greyscale? Is the 500x500 pixels an appropriate size? Which algo to use in the ones proposed by DIGITS ? Plenty of questions to solve and parameters to test once the data will be moved in the DIGITS lab. But how to do that?
Moving files from a laptop to the cloud in a way it is accessible by DIGITS was not trivial for us but the solution came from an other hackathon sponsor: Amazon AWS. The S3 bucket storage offered a solution to store your files and provide URL with "understandable" paths, not with some plenty of coded numbers you cannot use in scripts! What about the S3 cost? It didn't matter since Amazon provided the participants with a $100.00 credit fro the week-end. More than sufficient again!
Once everything was uploaded and usable in DIGITS, it was ... Sunday 1:00 PM and the prizing ceremony was planned starting at 3:00 !
Rapidly, with the support of Francois, we found the algo which recognized our logs pics the best. FANTASTIC, the objective was reached. YES, WE CAN recognize automatically if a log image contents some lithology of not!
This graphic illustrates that our model recognizes if a log contains a litho column with 70% of success.
Obviously, much more work has to be done to analyse this result and improve the accuracy. We know already several ways to improve the accuracy:
- Working on the training size set: More images, test the effect of selecting only lithocolumn portion of the log of the full interval as we did, creating 2 classes as we did or 4 like colored image with litho, colored image without litho, B&W image with or without litho ...
- Measuring the effect of the training pics size. We noticed that the accuracy drop if we move the pic size from 500 to 250. Is the reverse true?
- Testing some other algo. Francois told us that the DIGITS algo library will be enriched soon with some new one coming from TensorFlow.
If somebody is interested by this PoC and would like to apply it on it logs or image, we will be pleased to go deeper.
But the time was running fast and it was now the time for each of the 13 teams to present their work.
I have been impressed by all of them, but for me two were ahead. The "GANster" team which has illustrated it is possible to go from a velocity model to a seismic section and back without modeling but some deep learning. Could the stats make as good but faster and cheaper than physics for modeling and reverse modeling? The images below seem to say yes:
- From model to image:
- From image to model:
Having been working on seismic interpretation softwares in the past, I found also the work done by the team "It's not my fault" extremely powerful. They showed the possibility to use a ML to detect faults or "traps" on synthetic seismic images. Since the "trap" is an interpreted notion, it may be difficult to do it on real data but I think their ML approach could really help detecting objects such as faults or terminaisons (onlaps, downlaps ...)
Please notice also in the image below that they presented their results not using a prezi or a ppt but an web app they have built during the 2 days! Really a great job.
But, among all the teams, we have been the only one working on deeply unstructured objects such as log scan images. It was time to present our results. Our presentation starts at 2h19m17sec on the video below and we are muted at the beginning but it is better a little bit later. If you missed it when it was broadcasted in live on youtube, you can get it here:
The fact we have been working on unstructured data without producing a line of code but obtaining some results was appreciated by the jury (TOTAL, SHELL, DELL EMC representatives + the participants using mentimeter, another cool app for interactive presentations). As a result we received the "originality price" for our efforts and plenty of gifts from the sponsors!
Thank you to the jury, the sponsor and a big merci to the full agilescientific team!
and be sure I will try to come back ... may be with some other squirrels if possible!
For more info, more image, video, stats, slacks, voting results, notebook, python codes ... just go to agilescientific: agilescientific.com/blog/2017/6/13/le-grand-hack