Skip to main content

NLP for Machine Learning in Healthcare

  1. Machine Learning Models need High Quality Data Sets
  2. NLP for Data Cleaning
  3. NLP for Data Extraction
  4. Building Machine Learning Pipelines using NLP
  5. Real World Examples of NLP for Machine Learning

Within Healthcare text (including EMR, claims, clinical notes, lab results, pathology reports, clinical trials and more) there is a huge amount of variation in the ways people express the same concepts or relationships. Data Scientists can use Natural Language Processing (NLP) to capture this variation and create a normalized representation of the text, cleaning the natural language (i.e., documents of text) into a structured, formal representation.

The benefits of this normalization include:

  • avoiding data sparsity (enabling language to be modelled more accurately)
  • improved document search
  • realizing connections across different kinds of notes written by different healthcare professionals

Deep learning models such as transformer-based models, offer the possibility of end-to-end solutions starting with the unstructured text, but this requires very large training sets which may not always be possible or attainable.

Linguamatics NLP is a powerful text mining platform that can rapidly structure and normalize healthcare data. Using the Linguamatics NLP platform and I2E application can improve your Machine Learning (ML) pipeline:

  • providing quality feature sets for ML models
  • saving time on open-source tool configuration
  • minimizing the time spent on data cleaning
  • maximizing the time spent on analyzing data

Machine Learning Models need High Quality Data

The pursuit of quality data sets brings many challenges when deploying an ML model.  The quality of the data in feature sets will determine how well a model performs. Staying up-to date with the churn of programming libraries and emerging tools can be a daunting task and demands time away from analyzing meaningful data.

The acquisition and cleaning of data to attain the desired quality is also time consuming. It is estimated that 80% of the healthcare industry’s effort is allocated to data cleaning with only 20% generating insights.

80% of data scientists' time is spent finding, cleaning and reorganizing data yet only 20% using it

The Cognitive Coder (InfoWorld, Sept 2017):

NLP for Data Cleaning

Natural Language Processing (NLP) is used to convert unstructured text to a structured format, pulling out and normalizing relevant concepts, establishing their context (e.g., negated, hypothetical, part of a family history etc.), and relationships.

Linguamatics I2E takes unstructured information and puts it through a rigorous cleaning process in a fine-grained NLP pipeline.

I2E can accept input documents in a wide variety of formats including plain text, csv/tsv/psv, xml, pdf, pptx/ppt, doc/docx, xls/xlsx, html, and in various compressed forms. Results are exportable in standard file formats such as JSON, XML, tsv/csv/psv and HTML.

As well as out-of-the-box workflows, the system is highly configurable using an intuitive GUI or REST APIs, allowing very specific information to be extracted, for example, information specific to a particular therapeutic area. The resulting features provide high-quality data sets for ML models.

Using I2E for ML models re-balances the time spent cleaning data and eliminates the need for elaborate, or programming intense tool chains. 

For more information about the NLP pipeline, see

NLP for Data Extraction

I2E is configured for new tasks using extraction strategies (queries) which are constructed semi-automatically using a data-driven approach.

The system is used to process large amounts of data to discover the most common constructions and terminology which can then be selected by the user as part of their query. For training data and datasets, this makes the identification of target variables and labels fast and efficient.

The quality of extraction is evaluated using a built-in tool for measuring the precision, recall and F-score against a human curated gold standard.

Features can be based not just on concepts in contexts, but also on relationships e.g., between a parameter and its value. For example, you can configure I2E to extract information such as Body Mass Index (BMI) values or place these values into buckets of lowmediumhigh BMI.  

In predictive models, NLP is used not just for the features that go into the model, but also to create the training data for the model. For example, in building a model for opioid abuse prediction we also used the NLP to find the patients who had abused opioid drugs.  Links to more examples of NLP in ML pipelines can be found at the bottom of the page.

Building Machine Learning Pipelines using NLP

The following diagram shows how a Data Scientist can use I2E to supplement a machine learning pipeline by systematically converting unstructured text into a structured set of features: combining the data cleaning and data extraction phases described above.

Similar to machine learning pipelines, queries are developed against training data and then evaluated against test data before being applied in production against live data.

NLP Machine Learning ML Workflow

Real World Examples of NLP for Machine Learning

Data Scientists at leading organizations across healthcare are using NLP to feed machine learning:

  1. Kaiser Permanente Northern California- valvular heart disease population health management program
    Kaiser Permanent Northern California, a large, integrated healthcare system, used NLP to achieve positive and negative predictive values > 95% for identifying Aortic Stenosis and associated echocardiographic parameters. Their results showed that a validated NLP algorithm applied to a systemwide echocardiography database was substantially more accurate than diagnosis codes for identifying Aortic Stenosis. Their research showed that “leveraging machine learning–based approaches on unstructured electronic health record data can facilitate more effective individual and population management than using administrative data alone.” 
  2. Kaiser Permanente prediction of gout flares
    Kaiser Permanente set out to use a computer-based method to automatically identify gout flares using natural language processing and machine learning taking textual data from clinical notes. Using this process, they identified more gout flare cases (18,869 versus 7,861) and patients with ≥3 flares (1,402 versus 516) when compared to the claims-based method.
  3. FDA Adverse events prediction 
    Schotland et al (2020) used data from three key sources and extracted features for Target-Adverse event profiles (TAEs). These features were fed into an ensemble machine learning model. The study uses data from AE reports, peer reviewed literature, and FDA drug labels, for AEs reports for particular drugs. By inference, these drugs are linked to the drug target, which then enables a level of risk prediction for new drugs targeting the same protein. I2E was used to generate the Target-Adverse Event profiles across FDA drug labels, mapping the AE to MedDRA. The authors performed an analysis of the I2E text-mining query. AE recall was increased with linguistic strategies e.g. morphological variants; spelling correction; matching across conjunctions; and precision was increased using linguistic context and utilising document regions. The final query used for this study had a recall of 0.98, a precision of 0.94, and an F1 score of 0.96, when tested on 20 random drugs from this study used to train the query. When tested on 20 different random drugs from this study, the final query had a recall of 0.91, a precision of 0.90, and an F1 score of 0.90.
  4. Eli Lilly mining adverse event data to identify potential new uses for existing drugs
    The team at Eli Lilly used I2E, to extract all the information needed from over 2,500 clinical trials. This included any SAE from randomized trials in, along with study arm information (treatment, placebo, patient number), indication, trial description, and more. They then used PolyAnalyst (Megaputer), which provides access to a selection of machine learning algorithms, to calculate ranking statistics for the treatment-indication association. 

    In the paper, the authors describe a number of drugs  this workflow revealed that could be re-purposed for specific cancers. These include Telmisartan for colon cancer, Phylloquinone (vitamin K1) as a cancer preventative, and Aliskiren for gastric cancer.

  5. Johnson & Johnson Voice of the Customer (VoC) call feed categorization for building predictive models
    Johnson & Johnson uses Linguamatics NLP to annotate and categorize “voice of the customer” (VoC) call feeds, to gain insights into the real-world use of their drugs. Researchers in the Predictive Analytics group have built an end-to-end workflow to process the call transcripts, using agile text mining to make sense of the unstructured feeds. The calls are categorized and tagged for key metadata such as caller demographics and reason for calling (e.g., complaint, formulation information, side effect, drug–drug interactions). The extracted features are used as the structured substrate for machine learning algorithms, to assist in categorizing the call feeds and to build predictive models around the different products. Using the Linguamatics NLP platform in this workflow has more than doubled the efficiency in analysis; the accuracy of the NLP platform mining is at 95%, allowing the Medical Affairs teams to do longitudinal analysis of real-world patient outcomes.
  6. Roche mining MEDLINE abstracts to apply ML for prediction of success/failure of drugs in phase II or III with high accuracy  
    In a 2016 publication, researchers from Roche and Humboldt University of Berlin described how they used NLP to systematically identify all MEDLINE abstracts containing both the protein target and the specific disease indication of a known set of successfully approved or failed cancer therapeutics (for example, abstracts containing both “Her2” and “breast cancer,” or “c-Kit” and “gastrointestinal stromal tumor”). The researchers applied machine learning classifiers and found that the NLP-extracted data features could be used to predict success or failure of target-indication pairs, and hence, approved or failed drugs.

To learn more about NLP for machine learning in healthcare:

Contact us

Ready to get started?

Request a Demo

Questions? Ask our experts