You are Home   »   News   »   View Article

Flare - value from unstructured documents

Friday, July 21, 2017

UK consultancy Flare Solutions is developing a range of methods to get value from unstructured documents - including working out when one document is similar to another, and helping people to find and manage the documents they need.

UK consultancy Flare Solutions is helping oil and gas companies get more understanding from unstructured text based documents - and find new ways to find what they are looking for.

The work involves using computers analyse the raw text, looking for and recognising document analogues (where the content of a report seems similar to a 'standard' reference one).
The work is structured around probabilities, not absolutes, said Dave Camden, information management consultant with Flare Solutions. It shows you the most likely best answers, not just a 'yes' or 'no'.

This should make the system more robust than a system based on rules, like the 'expert systems' of the 1980s, which will easily collapse if the data behind them does not fit with one of the rules for some reason, he said.

As far as possible, the system aims to reduce cognitive bias, which can influence human decisions through pre-conceived ideas that are not always based on real facts or knowledge.

But when building up a knowledge base, for example by learning from existing text, you cannot entirely avoid cognitive bias since the text is written by people expressing their ideas and concepts. Choosing peer reviewed articles, looking across texts from multiple authors or choosing very large data sets will minimise the problem.

It is better if decisions can be made based on 'known facts', something people have collectively, rather than individually, shown to be true, he said.

'We have tried to strike a balance between using standard 'reference' values and relationships (known facts) and using information extracted from the target text (new facts that we can learn from),' Mr Camden said.

Analogues

Much of Flare's work is based around trying to help identify 'analogues', or where something is similar to something else.

As human beings, much of our understanding of how the world works is based around spotting patterns, where we have seen something similar to the event which is happening now.

Similarly, in geoscience, it can be helpful to find examples of wells, or rock type, from our past experience, which are similar to the wells we are working on now, he said.

The same approach could be used in reducing downtime on operational equipment. As (non-digital) human beings, we build up expertise about how things fail and indications that something might be about to fail. For example we learn that a small leak in a water pipe could indicate that the whole pipe is about to fail.

In the digital world, we can try to compare what is happening now with what has happened in the past, because equipment might fail by going through a similar sequence of events every time.

Inverse analogues can also be helpful, when you find something which is very different to what you would expect, based on what happened in the past. So maybe a computer analytics system can deliver these too, he said.

The human computer challenge

Computers are a long way from being able to understand much in the human world. As an example consider how hard it would be to explain a joke to a computer, he said.

Although you can program the computer to classify something as a 'joke', it could struggle with identifying similar jokes and would certainly not experience the joke as a human does.

Similarly, things in the real world can be very case specific - you could train a computer to understand when a certain component is about to fail, but the computer may not use that knowledge anywhere else.

This means that humans with domain expertise, or deep knowledge about a specific domain, will always be extremely important, because they will have to train the computer systems how to understand that domain, he said.

Explicit training or training based on human-written texts are currently a major component on making computers seem smart.

'Humans are passing knowledge onto systems and the systems then are learning from that and making inferences,' he said.

The CDA challenge

Common Data Access, an organisation which helps manage the UK's national oil and gas data, recently issued a technology challenge for companies to come and try to generate some value from the organisation's archive of exploration and production data and reports.

Flare participated in the challenge, working through about 25,000 unstructured data reports, mostly well documents for the North Sea.

Flare decided to try to look for 'analogues', parts of the North Sea which are very similar, but the similarity was not previously known.

OCR

The first step of working with the 25,000 exploration and production reports was to scan them with Optical Character Recognition (OCR), so the computer could 'read' them.

There was a continuous learning process - getting to know alternative spellings, and spotting errors from the optical character recognition, he said.

OCR doesn't get everything completely correct, but 'many mistakes tend to be somewhat predictable, such as confusing zeros with noughts, ones and the letter l,' he said. Spell checkers, or comparing the words with a reference list of words, can pick out many OCR errors. 'We'll build a big knowledge base to enable us to do that.'

OCR failures may add more 'noise' to the system, but are unlikely to create false matches, he said.

Another task was to remove so-called 'stop' words, like 'if' and 'but', which are commonly used in English but don't add anything to the technical understanding.

Formation Analogues

This project was based on characterising geological formations and finding analogues based on the words (terms) that occurred around the formation names in the 25,000 documents.

This particular project did not start with a list of known formations, rather the formation name were extracted from the text (although a previous similar project did use an existing field list).

It set out to characterise each formation by a number of factors - the type of lithology, the age and depositional environment, taking this data automatically from the text.

Text analytics

The next stage was to analyse the text to see which words (terms) occur in proximity to each other (co-occurrence). There were hundreds of thousands of terms in the CDA text set.

'Words that occur together generally share a similar concept,' he said. 'The idea is that you can tell a word by the company it keeps. Words that are infrequent probably contain more information than words that are frequent.'

The outcome of this analysis is, for every term, a 'co-occurrence fingerprint' of 300 values. By comparing these fingerprints we can measure how similar terms are.

Some of the terms are the formation names we have extracted from the text, others are the lithology values that we 'know' from our reference knowledge base that also occur in the text. For each formation, we now compare its co-occurrence fingerprint with those of each of 200+ lithology terms, thereby creating a 'lithology fingerprint' of 200+ values for each formation.

To match formation based on lithology, we just look for the most similar 'lithology fingerprints' between a user-chosen formation and other formations.

A similar process is carried out for other aspects of formations, like geological age, depositional environment, production, problems.

The system can work in any (human) language, although if it is based on a certain knowledge base (such as oil and gas documents), they are likely to be mainly in one language (English in this case).

Search

This text analysis can then be used as a basis for more sophisticated search tools.

Flare showed a prototype search based on the same similarity methods used in the Formation Analogue system. The used can input one or more terms (for example, 'turbidite, shale, and tuff') and the system will respond with a ranked list of formations that best match those terms.

For the future, Flare is currently developing systems around graph databases (which shows which terms are related to which other terms). 'That will give us a lot of capabilities in this space to do a lot of this kind of work,' he said.

Information management

Mr Camden's company, Flare, sees itself as an information management company, but 'we're looking to blur the boundaries between managing information and exploiting it,' he said, 'trying to link the information management world with people who consume the information.'

'Analytics is about trying to glean insight from information that's already out there,' he said.

'As information people, we've struggled for years with trying to make the thing relevant as far as the business end user is concerned. We see analytics as one way to achieve this.'



Associated Companies
» Flare Solutions Limited

CREATE A FREE MEMBERSHIP

To attend our free events, receive our newsletter, and receive the free colour Digital Energy Journal.

DIGITAL ENERGY JOURNAL

Latest Edition Aug-Sept 2024
Sep 2024

Download latest and back issues

COMPANIES SUPPORTING ONE OR MORE DIGITAL ENERGY JOURNAL EVENTS INCLUDE

Learn more about supporting Digital Energy Journal