February 4, 2016 was World Cancer Day, and February is National Cancer Prevention Month. Throughout this month, individuals and groups worldwide are writing and sharing about the importance of taking steps to reduce your risk of cancer on an individual level and also the importance of cancer research on a clinical level.

Linguamatics are one of the pioneers in investing in Natural Language Processing (NLP) text mining technology to improve patient outcomes and cancer care, and one of the few companies using NLP at all. We have been working in healthcare for over 10 years, and recently announced a collaboration with Cancer Research UK to improve the characterization of cancer patient data for precision medicine.

NLP is growing rapidly in healthcare not only for research, but also now in widespread use for computer aided coding and computer aided document improvement. Simon Beaulah, our Director of Healthcare Strategy,  has published a white paper on 9 ways Natural Language Processing is being used by scientists to improve our (actionable) understanding of cancer. This highlights how, by applying NLP, significant impact can be achieved in improving cancer care by targeting the following areas:



The transition to new value-based payment models is spurring provider demand for technologies that enhance patient care and minimize safety risks, and in turn reduce costs. Of particular interest are tools to help providers predict the likelihood of potentially avoidable outcomes, such as a hospital readmission, pulmonary nodules turning cancerous or the contraction of sepsis.

According to a recent Linguamatics survey, most hospital CMIOs support the use of predictive models to improve the quality of care. In addition, CMIOs believe that these models can be enhanced with the use of Natural Language Processing (NLP) to access insightful data from unstructured chart notes.

Clinical NLP Important Applications

The advent of accountable care, meaningful use, and the triple aim is creating an unprecedented demand for insightful patient data. Though structured data reveals valuable information, some 80% of EHR data resides in an unstructured narrative format. Furthermore, of the 1.2 billion clinical documents produced in the US each year, 60% of the valuable information exists in unstructured narrative documents that are largely inaccessible for data mining and quality measurement.

To gain better insight into patient data, providers might be inclined to expand their use of templates to capture discrete observations. Unfortunately, when purely coded templates take the place of free-text narratives, the resulting documentation often fails to capture subtle circumstances of a patient’s story. Frequently the patient narrative is the most effective means of communicating detailed information between healthcare professions.

What alternatives do providers have for preserving the patient narrative, while at the same time gain additional insights from a patient’s complete medical record? One option is to tap into the power of Natural Language Processing (NLP) technology.

It was great to see our paper on the i2b2 NLP challenge from last year published recently. The challenge looked at extraction of Coronary Artery Disease risk factors from unstructured patient data provided by the Research Patient Data Repository of Partners Healthcare. Having done previous i2b2 challenges, such as smoking cessation, after the competition had closed, we wanted to actively participate in the 2014 NLP challenge and see how we compared against other NLP groups in the competition. Linguamatics work with many academic medical centers and cancer centers and view collaboration as a key component of our customer relationships. As such, we wanted to share our success or failure with our peers and show how a commercial system can tackle these areas.

The i2b2 training set consisted of 790 annotated documents relating to 178 patients, which we decided to divide into training (70%) and development (30%) sets. The test set contained 514 documents from 118 patients. Contestants were set this task: extract CAD risk factors such as specific diseases (e.g. diabetes), medications, family history of CAD and lab results; also take into account when tests were carried out or whether a disease diagnosis was in the past or current.

Our team’s results were excellent and, at 91.7% Micro F-Score, were competitive with the best system in this challenge. I2E, being a rule based system, was well suited for the challenge compared to machine learning systems because:

Interest in big data in healthcare is expanding rapidly with the explosion in genomic data and adoption of electronic health records (EHR) resulting from the Affordable Care Act.

This data holds the promise of improved insights into patient outcomes, treatment effectiveness, patient satisfaction and population risk, which is why it is receiving so much attention.

Considerable focus is on how to integrate structured data within your organization, for example, to gain insights from lab data and disease coding, but this is just the first step.

A large proportion of healthcare data is still in an unstructured format represented as documents, reports and images that hold significant levels of detailed data on patients that is not captured, or is poorly captured in structured data. This unstructured text from pathology, radiology and patient narratives captures the entire patient journey and is critical to understanding patient populations, assessing clinical risk and providing a better understanding of disease.

However, the format of the data poses significant challenges to its application and often results in laborious manual extraction to turn it into structured disease codes or specific data sets such as cancer registries. These manual processes are not scalable for the level of discrete data required for analytics and outcomes analysis, but how can this be addressed?