University of Iowa Uses NLP to Improve Phenotype Extraction for Precision Medicine

June 4 2019

Precision medicine focuses on disease treatment and prevention, at the clinical level (in healthcare organisations), and within drug discovery and development (in pharma companies). Treatments are developed and delivered, taking into account the variability in genes, environment, and lifestyle between individual patients. Within the clinical arena, in order to understand the best treatment pathway for a particular patient or group of patients, it is important to be able to access and analyze detailed information from the medical records of patients, and ideally broader aspects beyond their medical history.

A great example of precision medicine within the clinical arena was presented at the Linguamatics seminar in Chicago in March 2019. At the University of Iowa, scientists at the Stead Family Children’s Hospital are working on a precision medicine research project. Alyssa Hahn (Graduate Student, Genetics) described how they are using Linguamatics Natural Language Processing (NLP) to extract phenotype details from electronic medical records of patients with suspected genetic disorders.

Chromosomal microarray testing can help diagnose suspected genetic disease

First, some background. Over 700 patients undergo chromosomal microarray (CMA) testing at the University of Iowa each year. CMA is a genomic-scale clinical test that detects deletions and duplications of genomic material in the DNA of patients suspected of having a genetic disorder. Classification of CMA results into Normal, Abnormal or VUS (variant of unclear clinical significance) depends heavily on manual chart review and subjective determination of the relevance of the genetic variant found to the clinical phenotype.

Using current standard of care practices, about 40% of these pediatric patients will receive a test result of VUS. This result essentially boils down to “we found something in your genome that is not normal but we don’t have enough information to know if it is contributing to your condition or not”. This level of uncertainty can be very unsettling to families and means that further clinical testing is needed for a genetic diagnosis.

Comprehensively phenotyping these patients improves diagnostic yield

On average, the CMA genetic testing referral form contains only three terms to describe the patient phenotype, so Alyssa and the team wanted to see if NLP can provide a more comprehensive phenotype landscape from more detailed patient records, and therefore assist with the diagnostic yield of CMA testing. Alyssa gave one example where the information from the CMA referral form resulted in VUS; but once more phenotype information had been found from medical notes, the patient was diagnosed with Verheij syndrome (a rare microdeletion syndrome of chromosome 8q24.3 that harbours PUF60, SCRIB, and NRBP2 genes).

The decision was taken to use the Human Phenotype Ontology. HPO covers over 13k clinically relevant phenotypic abnormalities in human disease, and serves to standardize phenotypic similarities to enable comparisons and computational analysis across patients and diseases.

Building a gold standard: challenges of Epic EHRs and manual curation

At the start of the project, Alyssa needed to build a gold standard set of manual records, to use for development of NLP queries and to check for precision and recall. The first challenge was to extract the data from Epic. The record structure within Epic is lost on export, so the resulting block of text had to be reformatted to make readable, and usable in Linguamatics I2E application. The team selected a cohort of 30 for manual review to find HPO terms. The inter-annotator agreement was low across the four annotators (see figure), but these annotations allowed iteration of I2E query development to refine and target the search strategies, and reduce both false positives and false negatives.

Inter-annotator metrics for the gold standard development. The bar chart on the left shows the number of HPO terms found by one, two, three or all four annotators, and the Venn on the right gives a more detailed view of these results. You can see that only 9% of phenotypes were found by all four annotators; and a surprising 56% of phenotype terms were only found by one single annotator.

NLP hugely increases patient phenotypes for genetic diagnoses

Alyssa described the iterative query development process possible with Linguamatics I2E, to extract HPO abnormal phenotype terms. Once the queries were ready, the larger set of records were processed, covering around 3,500 pediatric patients from 2012 to 2018. The results were impressive. On average, from the CMA referral form there were 1.9 phenotype terms found; from the patient records the manual curation found an average of 29.1 terms, and using NLP an average of 71.5 HPO terms were found - providing a more detailed view of patients which can improve understanding of rare diseases (see figure below comparing CMA forms to I2E extraction from EMRs).

Number of HPO terms per individual. The left-hand pie chart shows the number of phenotypic terms provided on the CMA referral forms; the right-hand pie chart shows the number extracted from the patient medical records using Linguamatics NLP, for the records analysed from 3,375 patients.

The precision of Linguamatics NLP was 91.65%, compared to the manually curated gold standard.

- Alyssa Hahn, Graduate Student, Genetics

In addition, I2E was able to recover over 1000 HPO terms which were not found by the manual curators. Of these, 92% were determined to be correct phenotypes of the patients. 

The team also looked at the cost savings. The time taken to extract phenotypes manually for 100 patients would be just over 34 hours, compared to 10 minutes using I2E. Extrapolating this to the 700 CMAs run at University of Iowa Stead Family Children’s Hospital translates to nearly 240 hours manually vs. just 1.2 hours using natural language processing.

This exciting precision medicine project is still on-going, and Alyssa outlined some of the next steps. The outcome of this work has the potential to aid the interpretation of clinical chromosomal microarrays, and to directly improve the diagnosis and clinical care of hundreds of patients seen at the Stead Family Children’s Hospital.


If you are interested in precision medicine research in healthcare or pharma, please contact us, see our Sanofi use case and webinar, or our Shire webinar.