The role of unstructured data in the big data healthcare debate

March 23 2015

Interest in big data in healthcare is expanding rapidly with the explosion in genomic data and adoption of electronic health records (EHR) resulting from the Affordable Care Act.

This data holds the promise of improved insights into patient outcomes, treatment effectiveness, patient satisfaction and population risk, which is why it is receiving so much attention.

Considerable focus is on how to integrate structured data within your organization, for example, to gain insights from lab data and disease coding, but this is just the first step.

A large proportion of healthcare data is still in an unstructured format represented as documents, reports and images that hold significant levels of detailed data on patients that is not captured, or is poorly captured in structured data. This unstructured text from pathology, radiology and patient narratives captures the entire patient journey and is critical to understanding patient populations, assessing clinical risk and providing a better understanding of disease.

However, the format of the data poses significant challenges to its application and often results in laborious manual extraction to turn it into structured disease codes or specific data sets such as cancer registries. These manual processes are not scalable for the level of discrete data required for analytics and outcomes analysis, but how can this be addressed?

Natural Language Processing, or NLP, enables unstructured text to be translated into discrete data fields by identifying the key concepts and their relationships in healthcare documentation. It enables concepts such as TNM cancer stage, patient ambulatory status and social support network, and ejection fraction to be identified in text and provided as structured fields.

The use of NLP in clinical IT has so far been oriented towards disease coding and clinical document improvement, but these need to be extended into new areas to provide the fuel for big data analytics and population health. We no longer need to limit our analytics to data that is already structured, but can use NLP to provide the data we need, making it possible to test new hypotheses much faster.

To be effective in analysing clinical and other patient related data, simply seeing e.g. a disease name is not enough. NLP establishes the context, whether the disease was ruled out, or whether the mention is in the context of a family history of the disease rather than a diagnosis for this patient. Semantic normalization is also key when analyzing unstructured text.

Knowing Crestor is the brand name of rosuvastatin calcium, which is an HMG CoA Reductase Inhibitor or Statin is vital. Such normalization ensures there is sufficient consistent data for analytics to be applied, avoiding sparsity of data. New sources of unstructured data are also becoming available from patient reported outcomes and social media. This information can also be monitored to see if it can help predict risk or outcomes. Therefore, NLP in the big data age needs above all to be flexible to handle the variety of sources that now make up the patient experience. Effective use of big data will be dramatically improved by making the best use of unstructured data as part of a wider strategy.

NLP plays a key role in turning unstructured text into usable insights and will need to be better incorporated into big data architectures in the future to deliver on its potential.