Well-logging plays an important role not only in petroleum exploration and exploitation but also in groundwater, mining, environmental, geotechnical and research applications. These logs measure porosity, resistivity, density, conductivity, saturation, etc. of subsurface rocks.
Few logs are excellent in identifying lithology – such as gamma and Poisson’s ratio, a few more are useful to understand the rock type, pore fluids as well as the pressure and compaction trends in the deposited sediments – such as density and p-wave velocity logs. Well-logging is the cheapest method to get subsurface data as expensive coring is not required.
But geologists and engineers struggle when desired logs are not complete, missing, or not uniformly collected. This happens for a number of reasons, such as recording errors, equipment failure, loss of contact with borehole, cave-ins or even the financial inability to carry out certain logs. As a geologist working in Alberta O&G, I also struggled with such incomplete, incorrect, and missing wireline logs to interpret or model the subsurface geology.
What to do? Re-logging is expensive. What is the solution?
Techniques such as Machine Learning can be used to draw patterns from known data, learn from it, and reliably predict unknowns. In this case, data from a clean log set is used as input to train an ML algorithm. It understood patterns from historical wireline log measurements and reliably predicted the missing logs.
The advantages are huge. ML identified patterns invisible to the human eye. The accuracy increased on an increasing amount of data. Valuable information could be extracted in just a few clicks.
This blog is about how I went about and solved my problem and I hope it also helps you in your work.
Identifying the problem is the first step in the development of an AI application. The next step is to acquire data and process it to make it ready for modeling. Then comes making an ML model. After modeling, the model is validated. And finally, the model is deployed to production.
The following steps illustrate how the user ingests, cleans, and transforms data to train an ML algorithm and then deploys the model to production.
Fig 1: Braintoy mlOS user interface
Step 1 – Acquire Data
I took the LAS file of 120 wells (fig. 2) from south of Fort McMurray, Alberta from a reservoir that shows similar geology in nature. The reservoir represents the McMurray formation.
The LAS file of 86 wells was sorted and 70 wells are taken into building the database for those that contain all desired logs. For simplicity, only few logs namely Depth, Bit, Caliper, CNCL (Neutron-porosity), DTC (compressional sonic travel time), DTS (shear sonic travel time), GR (gamma-ray), PE (photoelectric effect), SP (spontaneous potential) and ZDEN (density-porosity) logs were taken to build the database. Microsoft Excel power-query was used to accumulate the LAS file to one. The database contains close to a quarter-million rows.
Fig 2: Location of the wells and the process adopted to build the database for model building
Now the database was ready. Let me walk you through how I reconstructed synthetic well-logs using the data that we already had, and produce a new one that is unknown, missing, incomplete, or doubtful.
Step 2 – Data Engineering
Data engineering starts with loading data, then define a dataset, and then generating a cross-validation dataset from it. These steps will take you sequentially through it.
Loading the data (fig. 3) was easy and fast. The Data Engine allows uploading data in .csv format and can fetch data from online data sources such as GCS, S3 Bucket, MongoDB, PostgreSQL, etc. I had made a .csv file by the name well_las and uploaded it.
Fig 3: Loading data in CSV format or from GCS, S3 Bucket, and many more connectors
Once the data is loaded, I viewed the data.
To view data, click on the view icon as shown in fig. 4.
Fig 4: Viewing the data that I had loaded
Now I wanted to check out how I could wrangle the data.
Clicking on the desired dataset under ‘Manage Data’ opens data wrangling (fig. 5).
Fig. 5: Data Wrangling steps 1. Opening the view of Data Wrangling. 2. Data at a glance. 3. Analyze Features. 4. Data Wrangling.
The ‘Data Wrangling’ feature allows examining ‘Data at a Glance’ where the data, its range, visualization using graphs, simple statistics such as total row, missing rows, data type, min, max etc.. ‘Analyze Features’ allows the user to compare the data in graphical format or to see outliers. The user can modify the values if they want in ‘Data Wrangling’ by deleting rows, split columns, adding new calculated columns, and many others. These are sometimes useful to “wrangle” the raw uploaded data to make a dataset more suitable for ML model building.
After successful completion of Data Wrangling, we can move to the second step of the Engine i.e. ‘Define Dataset’.
The Define Dataset comprises 4 steps – Training & Targeting Features, Missing Value Handler, Feature Pre-processor and Review & Save.
Training & Target Features:
In this step, data from the clean log set are used as input features. The algorithm will be trained to understand patterns between wireline log measurements (input features), with the targeted log data being output feature (fig. 6).
In this use case, I selected PE (the PhotoElectric well log) as the ‘Target (Output)’ with the rest of the logs selected as ‘Feature (Input)’. What it means is that I will use all other available logs in my database to predict PE.
I see the data type, missing values and statistics for the features. It is important to recognize the data type and check for missing values. These will be handled in the next step.
Fig. 6: Selecting input and output features for model building and then seeing data type, missing values, and quick statistics
Missing Value Handler:
As machine learning does not like missing values in the dataset, we need to remove or replace them with something. In our case, there are 924 to 7,342 missing values. Considering that I have 222,949 rows in my raw data, it was okay to remove the rows that contain missing values so that a clean dataset can be prepared.
Fig. 7: Missing Value Handler (Removing missing value rows from the database)
This step allows using pre-built algorithms like zero to one normalization, categorical to numeric, standard scaling, min-max normalization etc. to pre-process the data for ML modeling.
As shown in fig 8, I selected standard scaling just for demonstration purposes. But I did not apply it as the value range did not vary too much and it was unnecessary to do so. It would have made no difference anyway.
Fig. 8: The ‘Feature Pre-processing’ step of ‘Define Dataset’
Review and Save:
This step names my dataset. I named it ‘well_tar_pe’ (fig. 9).
Fig. 9: Naming the dataset by clicking on Define Dataset button
I will now split the dataset into two parts – training dataset and validation dataset. It can split it in any percentage I want, the standard norm being 80% and 20% split. For our use case, I used the standard norm. There are a quarter-million rows. The system randomly selects 80% of it to form a training set and the remaining 20% as the validation set. The training set will be used to train the ML model and then the system will automatically use the validation set to test the ML model to calculate and show the performance metrics (shown later during Step 3 – Machine Learning).
At the ‘Explore Dataset’ under the ‘Cross-Validation Dataset’, I could reexamine the training and validation dataset created in the previous step. This is useful to check whether the training or validation datasets are bias-free (fig. 10).
Fig. 10: Generating the Cross-Validation files for the dataset
Fig. 11: Exploring the generated dataset which is now ready for ML modeling purpose
That completes the Data Engineering part. I am now ready to move to Machine Learning!
Step 3 – Machine Learning
Machine Learning Engine is the module to employ the dataset built in the previous section to make a model using various ML algorithms.
As we set our target as a regression problem (predicting a numerical value of PE), we need to choose the ‘Regression’ tab on the ‘Machine Learning Engine’ module (fig 12). Then click on ‘Add Base Model’ (1) to select the desired dataset (built in the previous section). This pops up the window of the ‘Select Regressor’ window (2).
I think that our dataset will work quite well with DecisionTreeRegressor – so I chose it. The system already suggested the best parameters for us and I clicked on ‘Select Regressor’ (2).
Fig. 12: Choosing an appropriate algorithm for building a model
Clicking ‘Create New Model Version’ creates a DecisionTreeRegressor ML model in a couple of seconds (3) (fig. 13).
Fig. 13: Machine Learning Model building progress.
A generic model name and version are available (fig 14). Selecting it shows the rank, error, auto-generated documentation, and a ready to publish button. The right window of the following fig. 14 shows the performance metrics. The created model shows a Mean Absolute Error of 0.133 suggesting a very satisfactory result.
Fig. 14: The created model with performance metrics shown
I also wanted to use a few more algorithms of choice. The system has many. It also allows you to add your own. It also has an ‘Autopilot’ option (fig 15) that builds models using popular algorithms and ranks them by their performance.
Fig 15: ‘Autopilot’ makes ML models using popular algorithms and ranks them by their performance scores.
Notice that the top four algorithms ranked as 1 as the Mean Absolute Error are 1.3 for all.
I choose RandomForestRegressor (version –v.8-v.a6b) (fig 17) for my purpose.
If we want, we can rename the tag for better recognition. I choose to keep it the same.
One of the most useful features I like is the automated documentation. It automatically records all the steps of the modeling used to build the ML model. This is important to understand the details of the model and for model governance. Automated documentation gives the model information, performance, raw and training dataset and their analytics, feature importance and a comparison of predicted vs original.
Fig 16: Features of auto-documentation
Now we are ready to choose and publish our best model. I chose the ML model that used the RandomForestRegressor. Hitting the publish button brings up a pop-up window for additional comments and confirmation (fig 17). The pop-up also provides an opportunity to select a reviewer for the model governance (described in the next module).
Fig 17: The selected ML model is being published for review
That brings us to the end of the Machine Learning part. The next step is Model Governance.
Step 4 – Model Governance
Good governance practice means that AI needs to be validated before production use.
The system sends an email to the selected reviewer – who might be another subject matter expert or the project sponsor. In doing so, the reviewer gets the full documentation to review the parameters of the model and validate if it is indeed giving the desired performance and results.
I am not going too deep into this as it deserves an article of its own!
Fig. 18: The reviewer can Accept or Reject the model
Once the reviewer accepts or rejects a model, the system sends an automatic email to the modeler (fig 18).
This brings me to the next stage i.e. Deploy Model.
Step 5 – Deployment
It is now time to deploy the accepted model to production use. Since a regression model was created, I select the Regression tab under ‘Deploy Models’. The accepted models will be available under ‘Select Model & Deploy’. Our accepted model is v.8 of ‘well_tar_pe’.
To deploy the model, we need to select the model first and then successively click ‘Use this model’, ‘Generate Code’ and ‘Deploy’ (fig 19). These buttons trigger a reconfirmation pop-up window.
Fig 19: Deploying the model to production in clicks
Clicking the Deploy button creates a ‘Docker’ process that bundles all the necessary code into a container and creates an API that can now be called from any application from anywhere in the world (fig 20). The process of sharing API keys with others is a simple process in the mlOS and is not covered in this blog.
Fig. 20: Deploying the model generates a Docker in the cloud
The model is now live. It’s time to interact with it in the next step – the dashboard.
Step 6 – Dashboard
A real-time utility dashboard is automatically generated for every deployed model. Opening the ‘Dashboard’ module will show a utility dashboard for all deployed models. I had deployed the ‘well_tar_pe’ ML model (fig 21) and hence clicking on the ‘well_tar_pe” icon will take me to well_tar_pe dashboard (fig 22).
Fig. 21 Dashboard window from where the user can interact with the deployed model
Fig 22: Automatically generated dashboard of the deployed model ‘well_tar_pe’
The main menu of the live dashboard allows the following:
- Realtime Prediction: Check the predictions from the deployed model
- Data Scoring: Import external files and the model will predict the target feature
- Data Analysis: The scored data can be graphically visualized
- Variable Importance: Lists input variables and their relative importance to the target
- About Model: Summary of the deployed model
Fig. 23: Uploading new data for scoring using the deployed model
Fig 24: External data is ready for scoring after the headers are validated by the system
Fig. 25: Validated data produced by the deployed model
Fig. 26: Data is scored. The original PE and the predicted PE can be seen here.
Fig. 27: The scored data is graphically presented.
The purple log is the original data. This was purposely hidden. After getting the predictions, the comparison shows a good fit. This proves the validity of my deployed model.
I successfully made a model that can reliably generate synthetic data for the PE log by using the historical well log data. My frustrations as a geologist did not have a solution back in the day when machine learning was not as popular. But now I found a solution.
I hope that this also helps your work, as it helped mine.
Dr. Subrata Biswas is a data scientist and a geologist skilled in sedimentation and tectonics, mapping, scientific computing, quantitative modelling, and database and software engineering. With a Ph.D. from the University of Vienna and 20 years’ experience, he is well versed in ML/AI, software engineering, GIS technologies, remote sensing, image processing, and high-performance computing.