We are always enthused to read about new ways to utilize text mining in the drug discovery and development process, and very much enjoyed the recent paper by Heinemann et al., “Reflection of successful anticancer drug development processes in the literature”. In this study, the researchers develop tools that allow the prediction of the approval or failure of a targeted cancer drug, using models based on information mined from MEDLINE abstracts, along with a slew of other quantitative metadata (e.g. MeSH headings, author counts, fraction of authors with industry affiliation, and more). 

I2E, Linguamatics text mining platform, enabled the researchers to sytematically identify all MEDLINE abstracts containing both the protein target and the specific disease indication of a known set of successfully approved or failed cancer therapeutics; for example, abstracts containing both Her2 and breast cancer, or c-Kit and gastrointestinal stromal tumor (GIST). I2E enables the use of large vocabularies or ontologies of genes and diseases to extract key information, and the researchers used I2E for the rapid retrieval of publications containing any one of the many synonyms of a protein target or indication. 

The researchers found that the set of approved target-indication pairs showed a significantly higher publication count, from 9 years before FDA approval, compared to the eventually-failing pairs. 

Taking the study further, they applied machine learning classifiers and found that the extracted data features could be used to predict success or failure of target-indication pairs, and hence, approved or failed drugs. They conclude:


During his January 2015 State of the Union speech, President Obama announced details of his administration’s Precision Medicine Initiative, which promises to accelerate the development of tools and therapies that are customized to individual patients. Precision medicine focuses on disease treatment and prevention and considers the variability in genes, environment, and lifestyle between individual patients.

Precision medicine takes into account healthcare’s relatively minor role in impacting a patient’s overall health and well-being, compared to the larger roles of genetics, health behaviors, and social and environmental factors. The precision medicine approach thus requires that providers have access to a wealth of patient-specific data. Thanks to advancements in genetic testing and new technologies, such as patient portals and remote monitoring devices, a wide variety of patient data is now readily available. Unfortunately, clinicians may have difficulty extracting data that is clinically relevant because much of the information is stored in an unstructured format.

Consider how a physician would glean information from a paper medical chart prior to EMRs. To understand a patient’s complete health status, the doctor would search through pages and pages of notes - obviously a time-consuming and error-prone task.


Drug safety is, of course, a prime focus of anyone in the pharmaceutical and biotech industries. The goal of any drug R&D project is to bring to market a safe and efficacious drug – oh yes, and ideally, create the latest blockbuster!

For monitoring drug safety, there are many tools and solutions. We were pleased to see the inclusion of Linguamatics I2E, in a recent paper from the FDA on the “Use of data mining at the Food and Drug Administration” (Duggirala et al, 2016, J Am Med Inform Assoc).

This FDA review covers a very broad range of text and data mining approaches, across both FDA databases (e.g. MAUDE, VAERS) and external data such as Medline, clinical study data, and social media.

Specifically, the FDA describe the use of I2E “to study clinical safety based on chemical structure information contained in medical literature. Linguamatics I2E enables custom searches using natural language processing to interpret unstructured text. The ability to predict the clinical safety of a drug based on chemical structures is becoming increasingly important, especially when adequate safety data are absent or equivocal.”


This Saturday, July 9, 2016, NLP-based text mining pioneers Linguamatics celebrate 15 years as the text analytics leader in the life science and healthcare markets.

Here are 15 things you may not know about Linguamatics:


Linguamatics recognized as Market Leader by analysts

(Cambridge, UK and Boston, USA – July 07, 2016)  Text analytics provider Linguamatics is pleased to announce that it has been recognized by Frost & Sullivan with a 2016 Market Leadership Award. The award is the result of extensive research conducted by market analysis experts Frost & Sullivan on the “NLP (Natural Language Processing)-Based Text Mining for Life Sciences” industry.

This new research highlights Linguamatics’ leading position in the market:

“Linguamatics has achieved a leadership position in the NLP for text mining and analytics market…with few participants operating in this technology and domain that have as many use cases as Linguamatics.” Sangeetha Prabakaran, Research Manager, Transformational Health at Frost & Sullivan

Read the full report

In the pharmaceutical sector, Linguamatics customers use text mining to gain business insights across drug discovery and development, including gene-disease mapping and target identification, clinical trial optimizations, and competitive intelligence. As the report points out, the number of potential use cases is expanding considerably: