Extracting Coronary Artery Disease risk factors, results from the i2b2 2014 Challenge

September 23 2015

It was great to see our paper on the i2b2 NLP challenge from last year published recently. The challenge looked at extraction of Coronary Artery Disease risk factors from unstructured patient data provided by the Research Patient Data Repository of Partners Healthcare. Having done previous i2b2 challenges, such as smoking cessation, after the competition had closed, we wanted to actively participate in the 2014 NLP challenge and see how we compared against other NLP groups in the competition. Linguamatics work with many academic medical centers and cancer centers and view collaboration as a key component of our customer relationships. As such, we wanted to share our success or failure with our peers and show how a commercial system can tackle these areas.

The i2b2 training set consisted of 790 annotated documents relating to 178 patients, which we decided to divide into training (70%) and development (30%) sets. The test set contained 514 documents from 118 patients. Contestants were set this task: extract CAD risk factors such as specific diseases (e.g. diabetes), medications, family history of CAD and lab results; also take into account when tests were carried out or whether a disease diagnosis was in the past or current.

Our team’s results were excellent and, at 91.7% Micro F-Score, were competitive with the best system in this challenge. I2E, being a rule based system, was well suited for the challenge compared to machine learning systems because:

  • The annotations provided were quite skewed towards certain risk factors. This made it difficult for machine learning approaches because the data was sparse, which doesn’t give enough data to build an accurate model. I2E’s interactive query development means we can use a data-driven approach to query development where we quickly iterate queries and modifications to see what results we get.

  • Named entity recognition was targeted to particular diseases and medications, rather than all disease and medications, and was therefore very reliant on use of domain terminologies to classify the hits. Use of domain ontologies to group and filter results is something that I2E does very well.

The team design a query strategy to use I2E’s rule based approach, taking advantage of the annotated data where possible, and incorporate machine learning techniques where appropriate. It is for this reason that we evaluated a number of machine learning approaches to deal with time attribute classification.  Our iterative approach to query development and rapid response times mean we can take a data driven approach to assessing unannotated data sets. This means our method works well with both annotated and un-annotated data sets. We can then build out a series of high recall patterns based on data and annotations that are then refined - see workflow below.

Our solution was built to 86% accurate with only 1.5 weeks of effort. It took just one more week working on query refinement and edge cases to bring the score up to its final level. The work was done by a small team including two new I2E users at Northwestern University and none of the team were clinical domain experts. This challenge illustrated how I2E can empower NLP users to quickly build new solutions in a short space of time. We are very proud of the team and of the power of I2E - two weeks’ work done by a non-domain expert showed how I2E can have a major impact on patient assessment and care. Congratulations to the team for a job well done!