You are Home   »   News   »   View Article

Targeting 'dark data'

Friday, June 30, 2017

PETRONAS is exploring automated tools of working through 'dark data'.

Usually, only about 10 per cent of data collected in oil and gas companies could be called 'light data', said Jagathesan Balakrishnan, manager of business insights with PETRONAS.

This means people can find it easily and trust it easily.

The other 90 per cent is made up data repositories or reports.

He was speaking at the Digital Energy Journal conference in KL in October, 'Connecting Subsurface, Drilling Expertise with Digital Technology.'

Lots of data gets put into repositories and eventually gets forgotten. This is sometimes called 'ROT' - Redundant, Obsolete or Trivial', he said.

Other data gets put into unstructured formats, such as reports, which people collect but don't really use, or don't even know is there.

The problem is that oil and gas people will typically search for what they want in the 'light data', whether or not they are likely to find it there.

To illustrate how silly this can be, Mr Balakrishnan told a story of a man who was searching for his keys under a street lamp. He had actually lost his keys in the park across the road, but was searching under the street lamp because there was more light there.

The amount of oil and gas data has been increasing a great deal in recent years, with new technology and sensors.

'From all of this data, only a fragment can be managed in a structured form. The rest ends up as dark data,' he said.

So how can the dark data be brought into the 'light'?

One useful step is to remove duplicate documents, so people know that they are only looking for one specific document.

Identifying duplicate documents with automated tools can be done fairly easily. But it is much harder when you have documents with similar content.

It can take about 10 minutes of someone's time to compare two files manually and decide which one is worth keeping, and these 10 minutes add up if you have thousands of duplicate files.

One approach is to use automated tools to assess how 'similar' a pair of documents are. It can show if the documents are exact duplicates, 'highly' similar, 'moderately similar', or don't have much in common at all (in which case they are probably different documents).

It can show you where the documents differ. A typical example for well logs is if you have two files with the same start depth and end depth, but part of the data for one of the curves has changed partway through the log.

The process can also help you get a better understanding of what data you have, he said. For example, it might tell you that a log file of a well has been split up into many smaller logs, each covering a certain depth range.


Another project is extracting data from pdf documents.

The standard way to analyse them is to use tools which extract the text from pdf documents, including Optical Character Recognition (OCR).

The weakness here is that the pdf format was designed for presenting data, not storing data. You might be able to extract the text, but you can't necessarily work with the data.

The computer systems need to be 'trained' what to look for. For example, you might try to automatically extract a key parameter from well reports, such as the drilling rig floor elevation.

You might need to have a list of oil industry synonyms, so you can identify where the same thing might have been given a different name.

As an example, you might analyse 150 final well reports, identify the drill floor elevation in 149 of them, and then find out you have matched it correctly in 131 of those, he said.

You can use 'fuzzy data search' which try to spot patterns from the data taken from the pdfs - and then ask subject matter experts whether they make sense. You can gradually improve the searching by feeding the human analysis back to the computer.

One idea for the future is to build a crowd-sourced tool which can support the verification of text mining by the subject matter expert - ie use the 'crowd' to train the algorithm, he said.

'Looking at all the things I'm seeing now, I realise data is like oil, and looking for data is costly affair,' he said. 'If you can extract the right insight it can be a gold mine.'

Associated Companies
comments powered by Disqus


To attend our free events, receive our newsletter, and receive the free colour Digital Energy Journal.


The future of subsurface data management? Building a data science lab data lake
Jane McConnell
from Teradata


Latest Edition January 2018
Jan 2018

Download latest and back issues


Learn more about supporting Digital Energy Journal