Drug safety is, of course, a prime focus of anyone in the pharmaceutical and biotech industries. The goal of any drug R&D project is to bring to market a safe and efficacious drug – oh yes, and ideally, create the latest blockbuster!

For monitoring drug safety, there are many tools and solutions. We were pleased to see the inclusion of Linguamatics I2E, in a recent paper from the FDA on the “Use of data mining at the Food and Drug Administration” (Duggirala et al, 2016, J Am Med Inform Assoc).

This FDA review covers a very broad range of text and data mining approaches, across both FDA databases (e.g. MAUDE, VAERS) and external data such as Medline, clinical study data, and social media.

Specifically, the FDA describe the use of I2E “to study clinical safety based on chemical structure information contained in medical literature. Linguamatics I2E enables custom searches using natural language processing to interpret unstructured text. The ability to predict the clinical safety of a drug based on chemical structures is becoming increasingly important, especially when adequate safety data are absent or equivocal.”


I attended a Big Data in Pharma conference recently, and very much liked a quote from Sir Muir Gray, cited by one of the speakers: "In the nineteenth century health was transformed by clear, clean water. In the twenty-first century, health will be transformed by clean clear knowledge."  

This was part of a series of discussions and round tables on how we, within the Pharma industry, can best use big data, both current and legacy data, to inform decisions for the discovery, development and delivery of new healthcare therapeutics. Data integration, breaking down the data silos to create data assets, data interoperability, use of ontologies and NLP - these were all themes presented; with the aim of enabling researchers and scientists to have a clean, clear view of all the appropriate knowledge for actionable decisions across the drug development pipeline. 

A new publication describes how text analytics can provide one of the tools for that data interoperablity ecosystem, to create a clear, clean view.  McEntire et al. describe a system that combines Pipeline Pilot workflow tools, Linguamatics I2E NLP linguistics and semantics, and visualization dashboards, to integrate information from key public domain sources, such as MEDLINE, OMIM, ClinicalTrials.gov, NIH grants, patents, news feeds, as well as internal content sources.


There's been a lot of excitement around recent studies on immuno-oncology in which the body’s immune defences are corralled to fight cancer. Experts consider it the most exciting advance since the development of chemotherapy half a century ago.

Many of our customers are involved in anti-cancer approaches based on modulation of immunosuppressive properties of immune cells; and are using I2E to help generate insight around immuno-oncology and the tumor microenvironment (TME). Cancers can be viewed as complex ‘rogue’ organs, with malignant cells surrounded by blood vessels and a variety of other cells, including immune cells, fibroblasts, lymphocytes, and more. The tumor cells and the surrounding non-transformed cells interact constantly, and developing a better understanding of these TME interactions is a valuable approach in immuno-oncology drug development.

Knowledge in this field is growing very rapidly which makes it very difficult for scientists to capture it manually, both because of the volume of publications, but also the variety and complexity of information.

Challenges include ensuring a thorough search to capture relationships between genes/proteins and their effect or correlation on or with a variety of cellular actors. These cellular actors included many of the immune system cells currently under investigation for immunotherapeutic approaches to oncology.

I2E provides the capability to find and extract these interactions from textual data, including capture of negation where needed. I2E allows efficient and effective searches over millions of text documents, and can harmonize the output to enable computational post-processing and visualization of these complex data.


Tom Schmidt, Managing Editor, IDG Strategic Marketing Services interviewed Dr. Jane Reed, Head of Life Science Strategy, Linguamatics, on how pharma and biotech companies use text analytics to reduce the time and cost of their clinical trials and get drugs to market faster.

The common statistic is that over 80% of data lies in unstructured text. Often, the way that people write things, whether in patents, healthcare records, or scientific literature, it's not easy to pull out the nuggets that are going to help with those decisions, whether around the real world value of your product, regulatory compliance, or many other different areas. Text analytics has to play a part in addressing many problems because of the volume of data that is unstructured.

Watch the full interview below.


With the ongoing focus on healthcare outcomes-based payment models, pharmaceutical companies face powerful pressures to demonstrate not just safety and efficacy of a new treatment, but also both cost effectiveness and comparative effectiveness. This means they must show that their agent is not only better than placebo but also better than other agents. Comparative effectiveness of any particular treatment can be established by interventional clinical trials, observational real-world evidence studies, or systematic review and meta-analysis. Access to on-going and past clinical trials via trial registries provides much valuable information, but effective search can be hindered by issues such as search vocabularies and problems of searching the unstructured text.

Merck recently published a paper, demonstrating the success of a text-mining pipeline that overcomes these issues and extracts key information for comparative effectiveness research from clinical trial registries. Researchers in the Informatics IT group wanted to search clinical trial registries (NIH ClinicalTrials.gov, WHO International Clinical Trials Registry Platform (ICTRP), and Citeline Trialtrove) and synthesize comparative effectiveness data for a set of Merck drugs, in order to: