Four million people die from diabetes annually. Novo Nordisk, a global healthcare company, has a mission to change that. Although it has a presence in 170 countries, is already helping 28 million patients, and supplies half of the world’s insulin, the company still faces an enormous challenge: novel drug approaches are needed, and drug development is a long, expensive process. The GLIA (Global Information & Analysis) team at Novo Nordisk aim to help by providing the best information possible to researchers and product teams.  

Using natural language processing (NLP) to extract information from real world data sources

The answers Novo Nordisk need are buried within a myriad of sources of unstructured real world data. These data sources include research papers, news reports, market information, patient use information, and more.

“Finding accurate information in an ever-growing ocean of information is becoming more important than ever,” explains Novo Nordisk senior information scientist Solmaz Gabery Adams.

Extensive research informs every step on the long path to delivering healthcare, from identifying needs and undertaking drug discovery to clinical trials and regulatory review before bringing new treatments to market. At every stage, Novo Nordisk researchers and managers must make crucial decisions, including which projects to advance and which projects to leave behind.


Drug safety is, of course, one of the central concerns of any drug development project. Right from the start, project teams want to know whether the target they are interested in has any links to adverse events. Or, when they get to lead series or lead compound, is there any evidence that similar compounds or compound classes have been shown to have side effects. If unexpected adverse events occur in clinical trials, again, project teams turn to literature and other sources to see if they can unearth a reason, mechanism, other evidence for this effect. And of course, post-market, pharmaceutical companies must regularly screen the worldwide scientific literature for potential adverse drug reactions, at least every two weeks.

Finding useful information from public sources can be daunting. There are so many different names for any particular gene target, or compound, or disease process, adverse event, side effect. Comprehensive search means using strings and strings of key words. And of course, what is really needed is evidence that a compound is causing an effect, not treating the disease. So, again, key word search doesn’t work well. And then, there are so many different data sources to search.


A key requirement in drug development – and increasingly in precision/personalized medicine and pharmacogenomics – is a comprehensive understanding of the genetic associations for the disease of interest. For a multiple sclerosis (MS) biomarker discovery project, Sanofi wanted to annotate the association of human leukocyte antigen (HLA) alleles and haplotypes with diseases and drug hypersensitivity, as the HLA genotype is responsible for some 30% of the risk of MS and participates in almost every aspect of the disease.

HLA alleles have been associated with multiple autoimmune diseases, various types of cancer, infectious disease, and drug adverse events, but there are no known resources that systematically annotate these associations.

Developing a Comprehensive Catalog of Disease Annotations using Natural Language Processing (NLP)-based Text Analytics

Sanofi identified more than 400 HLA alleles through a whole exome sequencing-based HLA typing and analysis workflow. These potential candidate biomarkers were not annotated in any database. Sanofi then used the Linguamatics I2E NLP solution to analyse and search the literature to annotate the association of the identified HLA alleles with diseases and drug hypersensitivity.

Sanofi linguistically processed and indexed a literature corpus of 25 million PubMed abstracts and 4 million full text journal articles with I2E text analytics, using an internally developed HLA gene ontology, alongside Linguamatics I2E’s dictionary of relationship verbs (e.g. causes, leads to, results in) and Diseases ontology. This identified HLA alleles and haplotypes and their relationships with diseases and drug sensitivity.  


There surely can’t be anyone in the pharma industry who hasn’t heard the story of thalidomide. The disaster that followed the release onto the market of thalidomide in 1959 triggered a wave of regulatory changes to ensure reliable evidence of drug safety, efficacy and chemical purity, before a new drug is released onto the market. 

While failure of clinical efficacy is the major cause of drug attrition, a poor safety profile is also a major factor in failure of drugs in development, at all stages from initial lead candidate through preclinical and clinical development to post-marketing surveillance. In order to ensure the safety of drugs on the market, rigorous testing is carried out throughout the pipeline, and can be categorised into preclinical safety/toxicology in animal models, clinical safety in human subjects, and then post-market pharmacovigilance, to look for safety signals across a wide patient population (see schematic below).

At every stage, critical data is being both generated and sought from unstructured text – from internal safety reports, scientific literature, individual case safety reports, clinical investigator brochures, patient forum, social media, conference abstracts. Intelligent search across these hundreds of thousands of pages can provide the information for key decision support. Many of our customers are using the power of Linguamatics I2E’s Natural Language Processing (NLP) solution to transform the unstructured text into actionable structured data that can be rapidly visualized and analyzed, at every stage through the safety lifecycle of a drug.


As medicinal chemists strive to fill the pipeline with the best possible novel compounds, they require efficient access to the ever-expanding mass of existing information and knowledge about compounds, targets, and diseases and how they are related. Much of this information is buried in published journal articles, patents, reports, and internal document repositories. Posing chemical compound-, target-, and disease-centered questions to extract and organize the data in order to explore these relationships is laborious, time consuming, and potentially error prone. Locating chemical structural information is especially challenging, when chemicals in the literature are described by many different names: technical, trivial, proprietary, nonproprietary, generic, or trade names.

Roche pRED decided to address this problem and equip their medicinal chemists with a chemically-aware text mining tool (Artemis) that would remove the need for manual searches and data-wrangling, and present the data in a user- and analytics-friendly environment for further exploration. Daniel Stoffler and Raul Rodriguez-Esteban, Roche, presented this work in their talk "ARTEMIS - A Text Mining Tool for Chemists" at Linguamatics Spring Text Mining Conference in 2017.