There seems to be a certain buzz around rare and orphan diseases. Following the Findacure meeting I attended last month, there are two recent events I’d like to mention.

Firstly, I attended the first Cambridge Rare Disease Network summit, held in Cambridge UK, with a fantastic line-up of speakers from a range of professions to discuss current and new initiatives in rare disease. The debates ranged from the use of next generation sequencing for diagnostics, to crowd-sourcing both for science and funding, to drug repurposing, to the views of payers and the issues around pricing.

For me it was also a reminder, particularly from some of the parent speakers, of the impact that rare disease has on individuals and families. All too often we are so busy with the day-to-day of research and business that it's easy to lose sight of the ideal end-goal - treatments for all adults, all children, affected by these disparate and often devastating diseases.

Secondly, this month the FDA released new draft guidance “to navigate the difficult road to approval of drugs for rare diseases”.


I attended the Findacure “Drug Repurposing for Rare Diseases” event last week; a small symposium with an interesting mix of attendees – academics, pharma, patient groups, vendors.  The main focus was networking, inspired by a series of short talks (see Findacure blog for more information).

  • 6,000 to 8,000 identified rare diseases (prevalence less than 5 in 10,000)
  • Only approximately 200 have licenced treatments – large unmet need
  • 1 in 17 people (6-8% of population) will develop a rare disease
  • 30-40 million people in US, 30-40 million in Europe
  • 75% of all rare diseases affect children

With the changing landscape from “blockbuster” to more personalised “nichebuster” therapeutics, and the incentives provided by regulatory bodies (such as FDA’s Orphan Drug Designation), rare diseases are an increasing focus of many of Linguamatics’ pharma and biotech customers.

So, I hear you ask – how does text analytics fit into rare diseases drug discovery?  It’s simple: Information associated with rare diseases is essential at many stages of drug discovery and development.  And, this essential information is often buried in unstructured text - in different data sources, with differing formats, vocabs, etc.


Giving a presentation on NLP text mining a couple of weeks ago*, I was asked whether our text analytics solution can help one of the extra Vs of big data – Veracity. This is a much-discussed topic at the moment, and after Volume Velocity and Variety, seems to be the most important of the additional Vs (see Seth Grimes blog for a good discussion on some more “wanna-Vs”).

Veracity, when it comes to data and decision making, can mean many things:

  • Does my conclusion make sense?
  • Is this particular data point accurate?
  • Do I trust this publication?
  • Is this assertion evidenced reliably?

 - but the bottom line is, if I am making an important business decision, how can I be sure it’s made using the best possible data?

This is obviously a tricky question and has been thrown into public view over recent years with studies trying to replicate critical experimental data and finding reproducibility frighteningly low (e.g. PLoS . So, how can a text analytics tools shed any light in such a minefield?

Scientists in the United States spend $28 billion each year on basic biomedical research that cannot be repeated successfully. That is the conclusion of a study published on 9 June 2015 in PLoS Biology that attempts to quantify the causes, and costs, of irreproducibility.


Over the past few months there have been several publications which have used Linguamatics I2E to extract key information to provide value in a variety of different projects. We are constantly amazed by the inventiveness of our users, applying text analytics across the bench to bedside continuum; and these different publications are no exceptions. Using the natural language processing power of I2E, researchers are able to answer their questions rapidly and extract the results they need, with high precision and good recall; compared to more standard keyword search, which returns a document set that they then need to read.

Let’s start with Hornbeck et al., “PhosphoSitePlus, 2014: mutations, PTMs and recalibrations”. PhosphoSitePlus is an online systems biology resource for the study of protein post-translational modifications (PTMs) including phosphorylation, ubiquitination, acetylation and methylation. It’s provided by Cell Signaling Technology who have been users of I2E for several years. In the paper, they describe the value from integrating data on protein modifications from high-throughput mass spectrometry studies, with high-quality data from manual curation of published low-throughput (LTP) scientific literature.


Better access to the high value information in legacy safety reports has been, for many folk in pharma safety assessment, a “holy grail”. Locked away in these historical data are answers to questions such as:  Has this particular organ toxicity been seen before? In what species, and with what chemistry? Could new biomarker or imaging studies predict the toxicity earlier? What compounds could be leveraged to help build capabilities?


I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations may not be human-relevant, and reducing late stage failures.

Coming as I do from a decade of working in data informatics for safety/tox prediction, I was excited by one of the talks at the recent Linguamatics Spring User conference. Wendy Cornell (ex-Merck) presented on an ambitious project to use Linguamatics text mining platform, I2E, in a workflow to extract high value information from safety assessment reports stored in Documentum.

Access to historic safety data is a potential advantage that will be helped with the use of standards in electronic data submission for regulatory studies (e.g. CDISC’s SEND, the standard for exchange of non-clinical data).