The combined value of NLP and Machine Learning – a concrete example

With the rising costs of de novo drug discovery, and increasing focus on rare diseases, there is continuous innovation for methods and solutions to find new uses for existing drugs.  I was interested to hear of a novel approach for this, published recently by Eric Su and Todd Sanger at Eli Lilly. In this paper, “Systematic drug repositioning through mining adverse event data in ClinicalTrials.gov”, the authors describe the combined use of Natural Language Processing (NLP) and Machine Learning (ML), to extract potential new uses of existing drugs.

It’s quite astonishing how often in the last weeks and months I’ve been asked about the interplay between NLP, Artificial Intelligence (AI), and ML. It seems that everyone wants to understand more about the real potential (rather than the hype that is being shouted from the rooftops) that these tools will provide to impact healthcare, research, and many other areas of our lives, in the next decade.

So, let’s delve further into this concrete example of the combined value of NLP and ML. The innovative step here was to exclude trials for a specific indication, such as cancer, and then find trials with Serious Adverse Events (SAEs) classified as cancerous. The researchers then looked to see if the placebo arm had more cancerous SAEs. If the placebo arm had more cancer-related SAEs than the treatment arm, they hypothesized that the treatment has a positive anti-cancer effect.


What are the challenges facing life science and healthcare organisations, where text analytics can play a part?  This is one of the key questions that I ask myself and others regularly. There is so much buzz at the minute around big data, real world data, healthcare informatics, wearables; but what is really working, and what is just hype?

One of the ways we get input on this question is, of course, meeting our customers and hearing about their successes. Linguamatics hosts two user group meetings every year, and our European Spring Text Mining Conference is coming up rapidly. Held over 3 days in April, the conference provides scientists and clinicians interested in text mining to come for hands-on training workshops, round table discussions, and a day of talks from both Linguamatics staff and our customers.

This year, our customer speakers encompass a wide range of use cases, spanning the pipeline of discovery, development, and delivery of therapeutics:


Clinical Trials text mining can speed key decisions, effective site selection and trial design 

Clinical trials form the cornerstone of evidence-based medicine, and are essential to establishing the safety and efficacy of new drugs. Each new drug, before being approved by regulatory agencies, must pass through a set of gates. At the very basic level these include phase 1 for first-in-human safety; phase 2 for efficacy and biological activity against the target; and phase 3 for safety, efficacy and effectiveness of the new therapeutic.

At each of these phases, careful planning is essential for a successful study. The clinical study protocol covers objective(s), design, methodology, statistical considerations and organization of a clinical trial, and ensures the safety of the trial subjects and integrity of the data collected.

Over recent years, clinical trial designs and procedures have become more diverse and more complex. The impact of precision medicine means trials have to be more carefully planned to ensure adequate statistical power for smaller patients groups, and adaptive, umbrella, basket and n-of-1 trials are now more frequent.

The regulatory requirements and growing complexity of clinical trials translates into more numerous and more complex eligibility criteria for study enrolment, increased site visits and required procedures, longer study duration, and more rigorous data collection requirements. From: PhRMA Biopharmaceutical Industry Profile 2016


At the end of 2016, I attended the CBI 2nd Annual IDMP Update Forum in Philadelphia, a small but highly focused and effective conference with two days of meetings and discussions. There were presentations by industry leaders involved in understanding and addressing the challenges that IDMP compliance presents to the pharmaceutical industry, and also presentations by some of the vendors in this area.

The meeting kicked off with a keynote from John Kiser, Senior Director, Regulatory Policy and Intelligence, AbbVie.  This brought up some of the key challenges for IDMP compliance that were repeated again and again across the conference:

  • We need to think strategically, about master data management (MDM), not just about what is needed for IDMP compliance.
  • Even though the timelines are moving out, it’s really important not to take our eye off the ball. IDMP projects are being driven out of EU, and the US has to get moving to keep up. Don’t wait, start planning and kicking off pilots and proof-of-concepts with vendors now.
  • IDMP compliance planning shouldn’t just involve regulatory affairs and supply chain departments, as IDMP will impact quality, clinical operations, pharmacovigilance and safety, production, IT and more.

How text analytics using I2E can help

One comment that interested me was that while manual curation may provide the data elements for the current understanding of Iteration 1, other strategies will be needed to deal with potential changes in the implementation guidance, and to accommodate the flexibility required for Iteration 2 and beyond.


Uncovering new toxicities from chronic non-rodent studies

Preclinical toxicology studies are an essential part of the drug discovery-development pipeline, to support the safe conduct of clinical trials. And drug safety is, of course, one of the most critical aspects to ensure during drug development.

We were pleased to see the recent publication by Merck on a text-mining approach to assess the value of chronic non-rodent toxicology studies. 

Preclinical safety assessment groups employ a variety of animal models and assays to satisfy regulatory agency requirements to identify and characterize drug toxicities, describe drug exposures, and provide qualitative and quantitative risk assessments for human exposure. These require considerable resource investment, however the results are often “locked away” in internal reports. This means re-use of these valuable data is difficult and costly.

This is a common situation within the pharmaceutical industry – where critical information is locked away in textual reports, such as the informed scientific conclusions of pathologists, histologists, safety experts. Natural language processing can overcome the barriers, extracting structured facts from unstructured documents, and Merck’s paper describes an evaluation of a text mining workflow to access these important data.