Faster, better, cheaper... how often have we heard these words, in the context of any process along the long path of drug development? There are a myriad of solutions that can help at different stages, enabling more comprehensive target assessment, more rapid lead optimization, and so on.  One of the most expensive parts of the drug development process is clinical trials, with bottlenecks including access to knowledge for site selection, patient populations, principal investigators and key opinion leaders. 

Researchers naturally look to utilize information from current and past trials but manually extracting the relevant information can be resource-intensive, repetitive and, therefore, prone to errors.  Time is money, so reducing costs and errors is critical.  

One of our customers, Merck, use Linguamatics I2E for text analytics over public domain clinical trial data, to improve clinical trial site selection. 

One example of the benefits of text analytics is a site selection project for Merck Experimental Medicine division (EMS). They needed to locate a clinical trial site that would be able to conduct gastric bypass trials with the ability to measure gut peptides before and after surgery. The ideal trial site needed to fit many different characteristics - over a dozen - which would be hugely time-consuming to find using the public domain search interface to ClinicalTrials.gov. 


Reading some of the FDA blogs reviewing 2015, I was interested to read that "for the second consecutive year, [the FDA] approved more drugs to treat rare diseases than any previous year in our history." This is great news for the patients affected by these rare or orphan diseases, and there is of course potential for applications of such drugs and the knowledge around these diseases across the wider population and in broader healthcare.

Text analytics can play a part in developing better understanding around the biology of these rare diseases. There's a great example of this application of text mining from Madhusudan Natarajan at Shire Pharmaceuticals. Shire develops and provides healthcare in the areas of behavioural health, gastrointestinal conditions, rare diseases, and regenerative medicine, and Madhu has presented his research using text analytics to uncover disease severity and genotype-phenotype associations for Hunter Syndrome (also known as Mucopolysaccharidosis II).

We recently hosted a webinar with Madhu. In this webinar, he illustrates some of the challenges for R&D for orphan diseases, particularly around text mining for mutation and variant patterns, which can be reported in so many different ways in the literature. 

Webinar: A systematic examination of gene-disease associations through text mining approaches


Animal models are crucial in the understanding of disease, the underlying pathways and the gene targets that play a role. One tool that has shown great value is the knockout mouse model.

The number of KO mouse models has increased massively since the first one in 1989, and mice models have been used successfully in increasing our understanding of diseases as varied as different cancers, diabetes, obesity, blindness, Huntington's disease, aggressive behaviour, and even drug addiction.

Understanding the landscape of KO mouse models for any particular disease area is important, and curated databases (e.g. IMPC or MGI) provide valuable data, but keeping track of new KO mouse models published in the scientific literature is challenging.

Peng Zhang, ‎Senior Staff Scientist at Regeneron Pharmaceuticals, uses Linguamatics I2E to tackle this challenge, and he presented on “Text Mining for Knockout Mice and Phenotypes” earlier this year.

 Diagram showing the set of KO genes involved in autoimmune phenotypes. All hits from both I2E and MGI were manually curated and only 479 unique KO genes were considered “true positive”. 61% true positives only came from I2E query and were not covered by MGI.


Earlier this year, Linguamatics announced our new Connected Data Technology for federated search, and in our newest version, I2E 4.4, we build on this to take another step along the path of better data interoperability. I2E 4.4 introduces a more powerful way to customize your text analytics results using enhanced linkouts in the HTML output, enabling you, for example, to connect your text-mined data to structured content.

Linkouts enable you to link out to, or pull in, additional information relating to the preferred terms (PTs) or concept identifiers (NodeIDs) in your query results. They can be hyperlinks, images or customized output. For example, you can configure linkouts to see information from an external website by clicking on the concept in the text-mined query results. Alternatively, it is possible to enable the interface to display an image in the query results, such as a chemical structure, instead of the preferred term.

This new functionality means you can use linkouts to enhance query results, by enabling you to access additional related information to provide more context or metadata for your search. So, for example, a search for chemicals from ChEBI could link directly from the preferred term in your results to the webpage for that concept on the EBI web site (e.g. Cyclosporine), whilst a gene name in the same result links to EntrezGene (e.g. ICAM1).


At the October Text Mining Summit, we had speakers from pharma, biotech and academia presenting on an amazing range of different applications of text analytics to provide value within the drug discovery-development pipeline. Over a day and a half we heard from a dozen external speakers from healthcare and pharma, all sharing their enthusiasm for the value that text analytics can bring to the drug discovery, development and delivery environments.

Work presented by UNCC researchers using I2E to understand potential health effects of plant phytochemical: Network map of text-mined associations linking Plant to phytochemical; Phytochemical to human genes; Human genes to biological pathways; Pathways linked to human health phenotypes.

The life science applications ranged from safety, target discovery and alerting, genotype-phenotype annotations, clinical trial analytics, phytochemicals as potential nutraceuticals, and patent landscaping for antibody-drug conjugates.

Back by popular demand, Wendy Cornell (ex-Merck) presented on gaining value from internal preclinical safety reports using I2E, which we’ve discussed in blog posts here before.