## Clustering considerations for machine learning

Thursday, July 2, 2020**Most datasets in oil and gas are multi-dimensional having many variables that make it difficult for us to analyse and find meaningful patterns. Therefore, the reduction of dimensionality is a fundamental part of a machine learning workflow and cluster analysis is one of the key tools used for this. Different dimensionality reduction and data clustering techniques are available.**

Philip Lesslar, a data solutions consultant with Precision DM, formerly with Shell and PETRONAS, explained some of the techniques, speaking at the Digital Energy Journal forum in KL in October, 'How to Digitalise Exploration and Wells'.

The preparation of data for any kind of analytics or machine learning starts first with reducing the dimensionality of the data and that can be done using a number of different multivariate statistical techniques.

He started first by describing a higher level group of dimensionality reduction statistical techniques.

a) Cluster analysis is a technique that aims to find 'natural' groups in multivariate data sets,

b) Principal Components Analysis looks at reducing dimensionality by finding a smaller set of variables that is still representative,

'Principal component analysis' can be used to reduce the dimensionality, identifying which variables make the biggest impact on others and which seem unrelated to others. So, for example you can reduce the number of dimensions you are working with from 100, which is very hard to make sense of, to 10.

You want to preserve the 'information' - the useful signal - while reducing the volume of data. An analogy of this is when we compress files to zip format. There are different types of zip - 'lossless', which preserves all the information, and 'lossy' which tries to only preserve critical information.

'So, we can think of this as like something we use already,' he said.

c) Factor analysis is useful for datasets where a large number of observed variables are thought to reflect a smaller number of unobserved variables,

'Factor analysis', similar to 'principal component analysis', can be used where you believe a small number of variables - perhaps unmeasured directly - drive a large number of other variables.

d) Multi-dimensional scaling is a technique that helps visualise similarity of samples by transforming onto a 2D plane, and

e) Linear and multiple regression are techniques where one or more independent variables are used to predict the value of a dependent variable.

He said that the aim of this talk is to just focus on cluster analysis and its significance on the machine learning workflow. He then described the general features of cluster analysis and some of the key types available:

Cluster analysis is a methodology for classification of objects with many data points. People have been doing classifications long before we had computers. For instance furniture is a class, and chairs, tables are subclasses. We recognise a chair when we see one even though it may look different from what we are used to. Our brains have learned the features that a chair possesses. Another example is Charles Darwin classifying organisms in the 1800s. Classification is part of how people learn about how something works, and part of how machines learn, Mr Lesslar said.

When working with data that have just two or three variables, it is easy to plot the data and visualise the groupings. Classification gets much harder when we are working with data that have many variables (multivariate). This type of data is harder to visualise e.g. seismic data, or training a computer to analyse an image. We often have to look at data in many different ways to get insights from it. All data sets contain 'things you can easily see, and information you don't see until you transform some of the data,' he said.

'Data analysis is not for the faint hearted,' he said. 'If you are a dabbler, dabble yourself out of it.'

Most machine learning projects start with some form of cluster analysis - the first step is to create meaningful groups out of a collection of objects (classification), and the second step is to build a model about how the groups behave, based on extracting features out of each group. Then in future the model can be used to identify which group a new object belongs in.

Definitions of machine learning workflows talk about the 'training data set' and the 'testing data set', which you use to build and test your models. It is important not to use the same data set for both.

Machine learning gets much harder the more variables you have, because you don't know which variables are most important. And there are many variables in exploration and production, he said.

**Cluster analysis techniques**

One popular visualisation-based clustering methodology for data sets in machine learning is K-means, which allocates every data point to the nearest cluster based on its affinity to the mean of that cluster.

Mean shift clustering is an iterative method where the computer looks for the 'highest mean density' of point groups.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is similar to mean shift clustering but is better at spotting outliers.

'If we take the complexity out of it, relate it to what we already know, then it becomes less of a mystery,' he said.

Another common method is Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM). This method looks for clusters where data follows a bell curve, or simple normal distribution. This detects elliptical clusters, not just circular clusters around a mean (centre).

'Agglomerative hierarchical clustering,' is progressive pairwise clustering or finding the two points with the strongest affinity, such as someone entering a bar and talking to the person they have the strongest connection with. They then form a two-point cluster. A third person, a friend, joins them and now they form a three-point cluster. 'You are carrying out cluster analysis in a bar,' he said.

The results of cluster analysis vary if you look at the data according to different dimensions. For example, individuals could be sorted according to their age, or home town. The 'hierarchical clustering' method finds the cluster dimension which makes the most sense.

In summary, there are two critical elements to your cluster analysis - one is the similarity measure used to calculate the 'closeness' of points in n-dimensional space, and the other is the clustering algorithm that calculates the progressive clusters.

**Proximity measures - Relations between data points**

He outlined a number of common proximity measures that are used to calculate 'similarity' of data points based on various attributes that these points possess. These attributes can be either quantitative (measurable) or qualitative (nominal). 'Many of these measures have been developed years ago' he said, 'an example being the Jaccard coefficient of similarity which was developed in 1908'.

Although these indices look mathematically complex, they can be more easily visualised and understood using Venn diagrams and set theory.

A particular measure, the Euclidean Distance coefficient, looks complex in its n-dimensional form '..but if we reduce the dimension to 2, then we have Pythagoras's Theorem for right angled triangles which we have all studied in secondary school.'

**Some examples from exploration data**

In order to illustrate the use of cluster analysis on real data, Mr Lesslar showed examples using prospect appraisal data, well logs and micropaleontology (foraminiferal sample data).

Prospect volumetrics typically include attributes such as POS (probability of success), MSV (mean success volume), HSV (high success volume), REC (recoverable), STOIIP (stock tank oil initially in place etc. By using these as input, one can expect that similar prospects will cluster together. He showed the resulting dendrograms in which some clusters were clearly seen and were consistently seen even using different clustering algorithms. Without any prior knowledge about these affinities and clusters, results may sometimes prove surprising and trigger further ideas.

In the example with logs, Mr Lesslar simply took the digital point values of several log types in a well section and ran them through a few clustering algorithms. Similar to the previous example, a number of clusters could be clearly seen. The key point in the exercise was to show that patterns were there but would depend on further work to ascertain whether these patterns were significant or meaningful.

The last example made use of foraminiferal assemblage data in well samples to show that these data lends itself well to cluster analysis. Clear groups could be seen and it is well known in micropaleontology that foraminiferal assemblages are environmentally sensitive and such clusters can be used to identify groups of environmentally similar assemblages.

When doing cluster analysis work, you might try various different techniques on your data set and see what happens. You might see some clusters which look particularly interesting, and look at them more closely.

'We make no assumptions about data, let's just explore,' he said.

The clusters may not be immediately obvious, but reveal themselves with cluster analysis methods.

If you see certain patterns appear, you can bring in domain expertise to try to understand if there is might be any sensible meaning behind them. Perhaps an expert might suggest looking for clusters around a certain depth, because it is in a different formation.

'Some parts we may never understand, they may be so complex. But at least if we know 80 - 90 per cent of moving parts, we have a better chance of getting meaningful results from this technology,' he said.

Oil explorers have used clustering techniques with micro fossil data to try to identify patterns in the groups of fossils found in different parts of a well, and how that relates to other factors, such as presence of hydrocarbons.

By tracking the patters in the micro fossils, you can try to map the migration path of oil back to the source rock. Or track different source rock types, source rock maturity, and the temperature and pressure of the source rock and sedimentation rate. And ultimately you can use it to find new source rock and identify if it might have generated oil.

In conclusion, he stressed that machine learning is not a black box. One needs to understand the machine learning workflows components, behaviours and limitations. Also it is so important to look at the data, then look at the results and look at the data again.

**COMPANIES SUPPORTING ONE OR MORE DIGITAL ENERGY JOURNAL EVENTS INCLUDE**