Drug safety and pharmacovigilance are critical aspects of drug development. To understand and monitor potential risks for pharmaceuticals, researchers use many different strategies to uncover evidence of real-world reports of adverse events and patient-reported outcomes.

At the upcoming Linguamatics Text Mining Summit, there are three talks on text mining strategies that improve our understanding of drug-related adverse reactions.

Nina Mian from AstraZeneca will present research on text mining adverse event data both from FDA drug labels (derived from clinical trial data), and also from real world data from PatientsLikeMe. Eric Lewis from GSK will discuss applications of I2E for clinical safety and pharmacovigilance – particularly the problems of identifying potential “new signals” and distinguishing signal from noise. And Stuart Murray from Agios will present workflows for automated identification of potential drug safety events.

These talks, from industry specialists, demonstrate the value of text mining to access and understand the complex world of drug safety and safety signals.


We are always enthused to read about new ways to utilize text mining in the drug discovery and development process, and very much enjoyed the recent paper by Heinemann et al., “Reflection of successful anticancer drug development processes in the literature”. In this study, the researchers develop tools that allow the prediction of the approval or failure of a targeted cancer drug, using models based on information mined from MEDLINE abstracts, along with a slew of other quantitative metadata (e.g. MeSH headings, author counts, fraction of authors with industry affiliation, and more). 

I2E, Linguamatics text mining platform, enabled the researchers to sytematically identify all MEDLINE abstracts containing both the protein target and the specific disease indication of a known set of successfully approved or failed cancer therapeutics; for example, abstracts containing both Her2 and breast cancer, or c-Kit and gastrointestinal stromal tumor (GIST). I2E enables the use of large vocabularies or ontologies of genes and diseases to extract key information, and the researchers used I2E for the rapid retrieval of publications containing any one of the many synonyms of a protein target or indication. 

The researchers found that the set of approved target-indication pairs showed a significantly higher publication count, from 9 years before FDA approval, compared to the eventually-failing pairs. 

Taking the study further, they applied machine learning classifiers and found that the extracted data features could be used to predict success or failure of target-indication pairs, and hence, approved or failed drugs. They conclude:


Drug safety is, of course, a prime focus of anyone in the pharmaceutical and biotech industries. The goal of any drug R&D project is to bring to market a safe and efficacious drug – oh yes, and ideally, create the latest blockbuster!

For monitoring drug safety, there are many tools and solutions. We were pleased to see the inclusion of Linguamatics I2E, in a recent paper from the FDA on the “Use of data mining at the Food and Drug Administration” (Duggirala et al, 2016, J Am Med Inform Assoc).

This FDA review covers a very broad range of text and data mining approaches, across both FDA databases (e.g. MAUDE, VAERS) and external data such as Medline, clinical study data, and social media.

Specifically, the FDA describe the use of I2E “to study clinical safety based on chemical structure information contained in medical literature. Linguamatics I2E enables custom searches using natural language processing to interpret unstructured text. The ability to predict the clinical safety of a drug based on chemical structures is becoming increasingly important, especially when adequate safety data are absent or equivocal.”


I attended a Big Data in Pharma conference recently, and very much liked a quote from Sir Muir Gray, cited by one of the speakers: "In the nineteenth century health was transformed by clean, clear water. In the twenty-first century, health will be transformed by clean clear knowledge."  

This was part of a series of discussions and round tables on how we, within the Pharma industry, can best use big data, both current and legacy data, to inform decisions for the discovery, development and delivery of new healthcare therapeutics. Data integration, breaking down the data silos to create data assets, data interoperability, use of ontologies and NLP - these were all themes presented; with the aim of enabling researchers and scientists to have a clean, clear view of all the appropriate knowledge for actionable decisions across the drug development pipeline. 

A new publication describes how text analytics can provide one of the tools for that data interoperablity ecosystem, to create a clear, clean view.  McEntire et al. describe a system that combines Pipeline Pilot workflow tools, Linguamatics I2E NLP linguistics and semantics, and visualization dashboards, to integrate information from key public domain sources, such as MEDLINE, OMIM, ClinicalTrials.gov, NIH grants, patents, news feeds, as well as internal content sources.


There's been a lot of excitement around recent studies on immuno-oncology in which the body’s immune defences are corralled to fight cancer. Experts consider it the most exciting advance since the development of chemotherapy half a century ago.

Many of our customers are involved in anti-cancer approaches based on modulation of immunosuppressive properties of immune cells; and are using I2E to help generate insight around immuno-oncology and the tumor microenvironment (TME). Cancers can be viewed as complex ‘rogue’ organs, with malignant cells surrounded by blood vessels and a variety of other cells, including immune cells, fibroblasts, lymphocytes, and more. The tumor cells and the surrounding non-transformed cells interact constantly, and developing a better understanding of these TME interactions is a valuable approach in immuno-oncology drug development.

Knowledge in this field is growing very rapidly which makes it very difficult for scientists to capture it manually, both because of the volume of publications, but also the variety and complexity of information.

Challenges include ensuring a thorough search to capture relationships between genes/proteins and their effect or correlation on or with a variety of cellular actors. These cellular actors included many of the immune system cells currently under investigation for immunotherapeutic approaches to oncology.

I2E provides the capability to find and extract these interactions from textual data, including capture of negation where needed. I2E allows efficient and effective searches over millions of text documents, and can harmonize the output to enable computational post-processing and visualization of these complex data.