It was great to see our paper on the i2b2 NLP challenge from last year published recently. The challenge looked at extraction of Coronary Artery Disease risk factors from unstructured patient data provided by the Research Patient Data Repository of Partners Healthcare. Having done previous i2b2 challenges, such as smoking cessation, after the competition had closed, we wanted to actively participate in the 2014 NLP challenge and see how we compared against other NLP groups in the competition. Linguamatics work with many academic medical centers and cancer centers and view collaboration as a key component of our customer relationships. As such, we wanted to share our success or failure with our peers and show how a commercial system can tackle these areas.

The i2b2 training set consisted of 790 annotated documents relating to 178 patients, which we decided to divide into training (70%) and development (30%) sets. The test set contained 514 documents from 118 patients. Contestants were set this task: extract CAD risk factors such as specific diseases (e.g. diabetes), medications, family history of CAD and lab results; also take into account when tests were carried out or whether a disease diagnosis was in the past or current.

Our team’s results were excellent and, at 91.7% Micro F-Score, were competitive with the best system in this challenge. I2E, being a rule based system, was well suited for the challenge compared to machine learning systems because:


Interest in big data in healthcare is expanding rapidly with the explosion in genomic data and adoption of electronic health records (EHR) resulting from the Affordable Care Act.

This data holds the promise of improved insights into patient outcomes, treatment effectiveness, patient satisfaction and population risk, which is why it is receiving so much attention.

Considerable focus is on how to integrate structured data within your organization, for example, to gain insights from lab data and disease coding, but this is just the first step.

A large proportion of healthcare data is still in an unstructured format represented as documents, reports and images that hold significant levels of detailed data on patients that is not captured, or is poorly captured in structured data. This unstructured text from pathology, radiology and patient narratives captures the entire patient journey and is critical to understanding patient populations, assessing clinical risk and providing a better understanding of disease.

However, the format of the data poses significant challenges to its application and often results in laborious manual extraction to turn it into structured disease codes or specific data sets such as cancer registries. These manual processes are not scalable for the level of discrete data required for analytics and outcomes analysis, but how can this be addressed?


Natural Language Processing (NLP) is a hot topic in healthcare.

At this year's AMIA Annual Symposium, in Washington DC,  we brought the discussion on clinical NLP to a roundtable held on Monday lunchtime and were also invited to the AMIA NLP workgroup to present some real-life use cases in clinical NLP.

However, as much as we like sharing what we're doing, we were keen to know what other people think when it comes to how NLP can transform patient care, today and in the future.

So that’s what we did – we asked peers at the AMIA conference that question (How can NLP transform patient care?) as part of a contest with an incentive of an iPad Mini and $50 Starbucks voucher for 1st and 2nd place respectively.

More than a third of entries identified mining the unstructured, free text narrative of a medical record to be crucial to the transformation of patient care. Unsurprising really, if you consider that around 80% of data in an electronic health record is unstructured and the only real way to get this information into a useable format is using NLP.

But what was interesting was the difference in how to use this data. Ideas included; for better patient information, for using the extracted coded concepts in clinical decision support and to retrieve full patient cohorts.

It was a tough contest to judge but the winning entry came from Edgar Chou at Drexel University College of Medicine. He had a few ideas but the one we thought was most interesting, with the potential to have the greatest impact on patient care was to around the payer care mix.


Rehospitalization is a serious problem in medicine.

Medical aspects are complicated by end of life care issues as well as a regulatory environment in which hospitals can experience financial penalties for "excess" rehospitalization rates. Existing rehospitalization predictive models, most of which are based on administrative data, have poor statistical performance, as do models that employ limited physiologic data.

At Linguamatics' upcoming seminar in San Francisco, Dr. Escobar will present work on a new rehospitalization model that employs data from a comprehensive electronic medical record and which could be instantiated in real time.

He will also present a "road map" to explain how data from natural language processing can be incorporated into this model as well as on future strategies for instantiation of NLP engines into routine clinical operations.

Dr. Escobar is a research scientist at the Kaiser Permanente Division of Research in Oakland as well as being the Regional Director for Hospital Operations Research for Kaiser Permanente Northern California.

An expert on risk adjustment and predictive modeling, Dr. Escobar has published over 130 peer-reviewed articles and is currently in the middle of deploying a real-time early warning system for deterioration outside the intensive care unit at two Kaiser Permanente hospitals.


IBM Watson gets a lot of attention in the medical field for trying to take capabilities that were demonstrated on the Jeopardy TV show and apply that cognitive reasoning to clinical care.

The complexities of disease combined with the mass of medical literature and clinical guidelines make this high dimensional problem an appropriate challenge for an industrial power house.

However, it should not be underestimated what can be achieved using sophisticated Natural Language Processing (NLP) for information retrieval in clinical decision support.

One of my favourite customer stories in recent years concerns our work with medical librarian Jonathan Hartmann from Dahlgren Memorial Library, the health sciences library at Georgetown University.

Jonathan’s role is to support the teams on the hospital’s paediatrics and internal medicine units on rounds at the Georgetown University Medical Center with access to the latest medical insights and publications relating to the current patient.

For example, should a patient with metastatic renal cell carcinoma be given warfin (an anticoagulant) for stroke prevention? Using his iPad at the bedside, Jonathan was able to quickly find journal articles that indicated cancer treatments and potentially cancer spread can indeed increase the risk of stroke.

You can read more about the story here.

From a technical perspective the use of NLP in this scenario is well hidden, as it should be, and simply ensures that the right information is provided to assist in clinical decision making.