Drug safety is, of course, one of the central concerns of any drug development project. Right from the start, project teams want to know whether the target they are interested in has any links to adverse events. Or, when they get to lead series or lead compound, is there any evidence that similar compounds or compound classes have been shown to have side effects. If unexpected adverse events occur in clinical trials, again, project teams turn to literature and other sources to see if they can unearth a reason, mechanism, other evidence for this effect. And of course, post-market, pharmaceutical companies must regularly screen the worldwide scientific literature for potential adverse drug reactions, at least every two weeks.

Finding useful information from public sources can be daunting. There are so many different names for any particular gene target, or compound, or disease process, adverse event, side effect. Comprehensive search means using strings and strings of key words. And of course, what is really needed is evidence that a compound is causing an effect, not treating the disease. So, again, key word search doesn’t work well. And then, there are so many different data sources to search.


A key requirement in drug development – and increasingly in precision/personalized medicine and pharmacogenomics – is a comprehensive understanding of the genetic associations for the disease of interest. For a multiple sclerosis (MS) biomarker discovery project, Sanofi wanted to annotate the association of human leukocyte antigen (HLA) alleles and haplotypes with diseases and drug hypersensitivity, as the HLA genotype is responsible for some 30% of the risk of MS and participates in almost every aspect of the disease.

HLA alleles have been associated with multiple autoimmune diseases, various types of cancer, infectious disease, and drug adverse events, but there are no known resources that systematically annotate these associations.

Developing a Comprehensive Catalog of Disease Annotations using Natural Language Processing (NLP)-based Text Analytics

Sanofi identified more than 400 HLA alleles through a whole exome sequencing-based HLA typing and analysis workflow. These potential candidate biomarkers were not annotated in any database. Sanofi then used the Linguamatics I2E NLP solution to analyse and search the literature to annotate the association of the identified HLA alleles with diseases and drug hypersensitivity.

Sanofi linguistically processed and indexed a literature corpus of 25 million PubMed abstracts and 4 million full text journal articles with I2E text analytics, using an internally developed HLA gene ontology, alongside Linguamatics I2E’s dictionary of relationship verbs (e.g. causes, leads to, results in) and Diseases ontology. This identified HLA alleles and haplotypes and their relationships with diseases and drug sensitivity.  


There surely can’t be anyone in the pharma industry who hasn’t heard the story of thalidomide. The disaster that followed the release onto the market of thalidomide in 1959 triggered a wave of regulatory changes to ensure reliable evidence of drug safety, efficacy and chemical purity, before a new drug is released onto the market. 

While failure of clinical efficacy is the major cause of drug attrition, a poor safety profile is also a major factor in failure of drugs in development, at all stages from initial lead candidate through preclinical and clinical development to post-marketing surveillance. In order to ensure the safety of drugs on the market, rigorous testing is carried out throughout the pipeline, and can be categorised into preclinical safety/toxicology in animal models, clinical safety in human subjects, and then post-market pharmacovigilance, to look for safety signals across a wide patient population (see schematic below).

At every stage, critical data is being both generated and sought from unstructured text – from internal safety report, scientific literature, individual case safety reports, clinical investigator brochures, patient forum, social media, conference abstracts. Intelligent search across these hundreds of thousands of pages can provide the information for key decision support. Many of our customers are using the power of Linguamatics I2E’s Natural Language Processing (NLP) solution to transform the unstructured text into actionable structured data that can be rapidly visualized and analyzed, at every stage through the safety lifecycle of a drug.


As medicinal chemists strive to fill the pipeline with the best possible novel compounds, they require efficient access to the ever-expanding mass of existing information and knowledge about compounds, targets, and diseases and how they are related. Much of this information is buried in published journal articles, patents, reports, and internal document repositories. Posing chemical compound-, target-, and disease-centered questions to extract and organize the data in order to explore these relationships is laborious, time consuming, and potentially error prone. Locating chemical structural information is especially challenging, when chemicals in the literature are described by many different names: technical, trivial, proprietary, nonproprietary, generic, or trade names.

Roche pRED decided to address this problem and equip their medicinal chemists with a chemically-aware text mining tool (Artemis) that would remove the need for manual searches and data-wrangling, and present the data in a user- and analytics-friendly environment for further exploration. Daniel Stoffler and Raul Rodriguez-Esteban, Roche, presented this work in their talk "ARTEMIS - A Text Mining Tool for Chemists" at Linguamatics Spring Text Mining Conference in 2017.


Innovative Artificial Intelligence and Machine Learning Technologies can improve Pharma R&D, Reduce Costs and Benefit Patients

The Pharma industry is constantly searching for more effective, more efficient tools and technologies to improve the drug discovery process. The statistics are well-known, and make gloomy reading: it takes 10-15 years to develop a new drug, at a cost of up to $1 billion. There is currently vigorous discussion over whether new tools and technologies can significantly impact these metrics. Big data, blockchain, artificial intelligence (AI) and machine learning (ML) are much talked-about as holding the key to digital transformation of drug discovery.

At the Bio-IT World Conference & Expo last month, many of these themes were explored. Across the dozen or so session tracks, there were talks and workshops to share information and best practises on how scientists in biotech, pharma, academic institutes and vendor companies are applying AI and ML for a variety of use case, such as models for adaptive clinical trials, imaging analytics (e.g. for pathology or clinical sample data), lead design, QSAR, analysing data streams from mobile monitoring devices, and more. Some snippets from the talks include: