Drug safety and pharmacovigilance are critical aspects of drug development. To understand and monitor potential risks for pharmaceuticals, researchers use many different strategies to uncover evidence of real-world reports of adverse events and patient-reported outcomes.

At the upcoming Linguamatics Text Mining Summit, there are three talks on text mining strategies that improve our understanding of drug-related adverse reactions.

Nina Mian from AstraZeneca will present research on text mining adverse event data both from FDA drug labels (derived from clinical trial data), and also from real world data from PatientsLikeMe. Eric Lewis from GSK will discuss applications of I2E for clinical safety and pharmacovigilance – particularly the problems of identifying potential “new signals” and distinguishing signal from noise. And Stuart Murray from Agios will present workflows for automated identification of potential drug safety events.

These talks, from industry specialists, demonstrate the value of text mining to access and understand the complex world of drug safety and safety signals.


Life sciences researchers who employ text mining and NLP (natural language processing) techniques to extract discrete facts and insights from scientific articles have generally relied on MEDLINE abstracts to define a corpus.

But there is increasing interest in mining full-text articles, with researchers experimenting on corpora sourced to Open Access repositories like PubMed Central (PMC). Organizations are eager to take advantage of the unique benefits full text provides, and rightly so.

Full-text content provides insights that researchers otherwise wouldn’t have had access to using abstracts alone. Here are three central benefits of mining a full-text corpus:

Volume. Full-text articles include more named entities and relationships between those entities than their corresponding abstracts – this is intuitively obvious when we consider the length of an abstract versus its full-text article. A study published in the Journal of Biomedical Informatics makes this point quantitatively: Only 7.84% of the scientific claims made in full-text articles are found in their abstracts.[i]


Wilmington, DE – The Pistoia Alliance, an organization dedicated to improving global life sciences R&D, has seen its membership increase with a number of new members including both large multinationals and start-ups.

The new members include Accenture, Linguamatics, Novaseek, Repositive, Agrimetrics and Daniel Taylor. Existing members upgrading to the new Startup membership category this quarter include KNIME, Scitegrity, Databiology, The Hyve, Binocular Vision, BioVariance, and Promeditec. This takes the membership of the Pistoia Alliance to over 80 globally, which includes many of the world’s biggest pharmaceutical companies, many of the most innovative start-ups and companies and organizations that support the life sciences sector.

Dr. Steve Arlington, Pistoia Alliance President said: “I am delighted to see that the Pistoia Alliance continues to show strong growth across all segments in life sciences, and we continue to attract a broad range of members. At the same time, we are also changing how we operate. Our challenge is to promote and encourage pre-competitive collaboration between our members, to benefit our members and ultimately accelerate the delivery of new drugs, devices and services to enhance performance within the sector. The Pistoia Alliance is well placed to help life sciences tackle many of its challenges and through our new strategy and the continued support of our members we will continue to support the global life sciences industry.”


Until recently, the use of natural language processing (NLP) in healthcare has been primarily limited to research efforts and population health within academic medical centers. However, with the proliferation of unstructured data from electronic medical records, providers are now seeking to harness the potential of their data and considering a variety of use cases for NLP technology.[1] That’s the conclusion of a recent KLAS report entitled “Natural Language Processing: Glimpses into the Future of Unstructured Data Mining.”

The report includes insights from 58 provider organizations and examines the various ways providers are currently leveraging NLP technology, as well as some of use cases poised for wider adoption. Coding and documentation applications represent the broadest use of NLP engines. But it is clear providers have a growing interest in NLP solutions that advance their population health initiatives. An increasingly popular use case, involves applications that use NLP to mine unstructured data within patient populations and include predictive analytics to identify at-risk patient populations.

 A few of the major findings from KLAS’s report are summarized below.

How is NLP being used today?


We are always enthused to read about new ways to utilize text mining in the drug discovery and development process, and very much enjoyed the recent paper by Heinemann et al., “Reflection of successful anticancer drug development processes in the literature”. In this study, the researchers develop tools that allow the prediction of the approval or failure of a targeted cancer drug, using models based on information mined from MEDLINE abstracts, along with a slew of other quantitative metadata (e.g. MeSH headings, author counts, fraction of authors with industry affiliation, and more). 

I2E, Linguamatics text mining platform, enabled the researchers to sytematically identify all MEDLINE abstracts containing both the protein target and the specific disease indication of a known set of successfully approved or failed cancer therapeutics; for example, abstracts containing both Her2 and breast cancer, or c-Kit and gastrointestinal stromal tumor (GIST). I2E enables the use of large vocabularies or ontologies of genes and diseases to extract key information, and the researchers used I2E for the rapid retrieval of publications containing any one of the many synonyms of a protein target or indication. 

The researchers found that the set of approved target-indication pairs showed a significantly higher publication count, from 9 years before FDA approval, compared to the eventually-failing pairs. 

Taking the study further, they applied machine learning classifiers and found that the extracted data features could be used to predict success or failure of target-indication pairs, and hence, approved or failed drugs. They conclude: