Text Analytics and Natural Language Processing in Drug Discovery

Understanding gene-disease associations, pathways and systems, is critical for drug discovery and basic research.  Much of the data to support these decisions are buried in unstructured text, both in public databases and internal sources, and it’s a constant challenge to keep on top of this information. Linguamatics NLP platform transforms this text into actionable data that can be quickly visualized and analyzed at every stage in drug discovery.

Early drug discovery

Pharmaceuticals and biotechnology organizations face the increasing challenge of discovering new drugs. A deluge of information is available but keeping on top of the latest literature and data using traditional methods of searching has become increasingly impossible.

Typically a project team will define a disease area and the therapeutic need to be met. Then they look to identify the biochemical, cellular or pathophysiological mechanism that will be targeted, and if possible, the identification and validation of a molecular drug target (the key protein involved). Next comes the identification of a lead structure, following by the design, testing and fine tuning of the drug molecule. For biopharmaceuticals, the concept of protein target and lead compound has much less relevance and the problems revolve mainly around the production, purification and formulation of the antibody or therapeutic modality, in a form suitable for delivery to patients.

Gene-disease mapping and target identification

The first step in the discovery of a new medicine is to identify the biological origin of the disease, and potential targets for intervention.

This requires a comprehensive understanding of the genes involved in the disease pathway, so a systematic review of the public domain literature is important. It is just as important to understand the intellectual property around any potential target area, in order to prioritize key targets for further development. Assessing unmet medical need and understanding the current market and potential gaps are also strategic issues that are important in early project decisions.

Trying to keep abreast of all the relevant literature you need to identify targets and understand the association of genes and diseases becomes an almost impossible task. Key information can be missed. Access to the landscape of up-to-date information around on-going clinical trials, patent filings and competitor activities can be time-consuming.

Natural language processing-based text mining can provide a solution to more rapidly access and analyze key information relevant to discovery project teams.

Case Study: Target Prioritization at Pfizer

This case study describes how Linguamatics NLP platform has been used to capture valuable information from the Life Science literature, saving time and increasing productivity.


Case Study: Target selection at AstraZeneca

AstraZeneca’s aim was to integrate text mining with other discovery capabilities so that the value of literature information could be exploited more widely. Using Linguamatics NLP platform the team developed an agile and scalable enterprise text mining capability to improve productivity and to develop an objective, holistic view of target options and improve quality in early discovery decision-making.


Case Study: Text Mining within a Biotech Setting at Syntaxin

Linguamatics NLP platform has provided Syntaxin with the capability to span the required knowledge domains at scale and effectively with a focused informatics team – a key requirement in a resource constrained biotech environment. Combined text mining queries via Linguamatics NLP allowed an informed disease selection process to be implemented. Searches routinely explore new areas of biology and opportunities for the company.


Biomarker discovery

Biomarkers are now an essential part of the drug discovery process. A recent study by AstraZeneca (Cook et al. (2014); Nat Rev Drug Discov) found that:

  • 82% of projects were active or successful in Phase IIa when they included efficacy biomarkers, compared to 30% of projects without biomarkers
  • Safety and PK/PD biomarkers are critical to successful projects
  • Clinical biomarkers should be an “integral part of the R&D programme,” and used to guide patient selection “as early as possible”

Biomarkers can be defined as naturally occurring molecules, genes, or characteristics by which a particular pathological or physiological process, disease, etc. can be identified. There are two major types of biomarkers: biomarkers of exposure, which are used in risk prediction and safety/toxicity assessment; and biomarkers of disease, which are used in screening and diagnosis and monitoring of disease progression.

These biomarkers can take different forms, e.g. enzymes with varying activity, changes in expression levels of particular genes, or the presence or absence of individual metabolites. The flexibility of Linguamatics NLP allows the user to search for any of these data types and to find relationships between known or novel markers, and diseases, mutations, drugs and more.

Linguamatics NLP platform allows researchers to identify targets in disease areas of interest and establish ranking based on factors such as safety and potential for therapeutic benefit. Related areas such as biomarker discovery and genotype-phenotype associations can also benefit.

Merck Use Case for Biomarker Discovery

Researchers at Merck used Linguamatics NLP and other tools to discover potential novel biomarkers and phenotypes for diabetes and obesity, from PubMed, clinical trial data, and internal Merck research documents. Trugenberger et al. (2013): "Discovery of novel biomarkers and phenotypes by semantic technologies”. BMC Bioinformatics; 14:51.

Roche use case: Creation of in-house database for candidate biomarkers

Martin Baron, Information Scientist, Roche Diagnostics, presented work done in his team to create a biomarker database. Roche used I2E to create a knowledgebase of disease-biomarker associations by mining Medline and full-text pdfs that could then be queried by scientists. Over 40 different searches were combined, to create a flexible set of queries that covered key aspects of gene-disease relationships (.e.g. altered expression, regulatory modifications, genetic variations, mutations, negative associations, etc.) that are not often found in available structured gene-disease databases. The initial disease-marker associations (DiMA) knowledgebase contained over 350k associations between 12k genes and 3.5k diseases. In addition, I2E can also scan the literature for specific disease biomarker associations on a day-to-day basis, to maintain the currency of the in-house knowledgebase.


By massively speeding up the rate at which genes can be sequenced, Next Generation Sequencing (NGS) promises to revolutionize applied markets like diagnostics, drug discovery, biomarker discovery, agriculture & animals research, and personalized medicine. The final step for any next-gen sequencing pipeline is the biological interpretation, which tends to involve manual searching of databases and recent scientific literature.

Linguamatics NLP platform can dramatically shorten the time needed for this analysis, providing a detailed and focused gene profile for the genotype, variants, mutations and phenotype under investigation.

Sanofi Use case for Multiple Sclerosis (MS) biomarker discovery

Sanofi established a workflow for NGS-based HLA typing (including whole exome, RNA-seq) and analysis that identified more than 400 HLA alleles. They used the Linguamatics I2E platform to analyze and search the literature to annotate the association of the HLA alleles with diseases and drug hypersensitivity.


Drug repurposing

Drug repurposing (also known as drug repositioning, drug re-profiling or therapeutic switching) is the application of known drugs and compounds to new indications.

With the large amount of pharmacological and biological knowledge available in literature, it has become increasingly feasible to find novel drug indications for existing drugs using an in silico approach.

Identifying alternative potential disease areas for an approved drug enables the drug to be repurposed for a new market at a fraction of the cost it takes to get a new drug to market. And patents can be filed that will extend the life cycle of a drug before it is subjected to competition from generic alternatives, thus having a significant impact on sales and profitability.

Linguamatics NLP platform can be used to assess potential associations between compounds, their target proteins, and novel disease-related pathways, by comprehensive analysis of scientific literature.

Tactics for using NLP for drug repurposing include exploiting the existing domain knowledge (around drugs, diseases, and mechanisms) to scan literature and other textual resources systematically and exhaustively, to identify and validate relationships between these entities.

With the rising costs of drug discovery, and increasing focus on rare diseases, there is continuous innovation for methods and solutions to find new uses for existing drugs. Read this blog to learn more about text analytics for systematic drug repositioning:


Lead identification and lead optimization

The identification of high-quality hits and lead compounds is of paramount importance in the drug discovery process. An understanding of the relationship between structure, activity and the mechanistic aspects of action are key in selecting the best chemical class for further optimization. It is therefore crucial to obtain high-quality data on affinity, kinetic, mechanistic and thermodynamic aspects of an interaction between potential drug candidates and their targets.

Case Study: Text mining at Roche pRED

Roche medicinal chemists were looking for a better solution to querying the ever-increasing flood of journal articles, patents and diverse sources for compound/target/disease relationships. They wanted to investigate whether chemically aware text mining could also speed and improve decision making process. Roche developed the Artemis system based on Linguamatics NLP solution, augmented with ChemAxon’s chemical annotation and name-to-structure tools, to extract and organize compound/target/disease relationships.