As medicinal chemists strive to fill the pipeline with the best possible novel compounds, they require efficient access to the ever-expanding mass of existing information and knowledge about compounds, targets, and diseases and how they are related. Much of this information is buried in published journal articles, patents, reports, and internal document repositories. Posing chemical compound-, target-, and disease-centered questions to extract and organize the data in order to explore these relationships is laborious, time consuming, and potentially error prone. Locating chemical structural information is especially challenging, when chemicals in the literature are described by many different names: technical, trivial, proprietary, nonproprietary, generic, or trade names.

Roche pRED decided to address this problem and equip their medicinal chemists with a chemically-aware text mining tool (Artemis) that would remove the need for manual searches and data-wrangling, and present the data in a user- and analytics-friendly environment for further exploration. Daniel Stoffler and Raul Rodriguez-Esteban, Roche, presented this work in their talk "ARTEMIS - A Text Mining Tool for Chemists" at Linguamatics Spring Text Mining Conference in 2017.


Innovative Artificial Intelligence and Machine Learning Technologies can improve Pharma R&D, Reduce Costs and Benefit Patients

The Pharma industry is constantly searching for more effective, more efficient tools and technologies to improve the drug discovery process. The statistics are well-known, and make gloomy reading: it takes 10-15 years to develop a new drug, at a cost of up to $1 billion. There is currently vigorous discussion over whether new tools and technologies can significantly impact these metrics. Big data, blockchain, artificial intelligence (AI) and machine learning (ML) are much talked-about as holding the key to digital transformation of drug discovery.

At the Bio-IT World Conference & Expo last month, many of these themes were explored. Across the dozen or so session tracks, there were talks and workshops to share information and best practises on how scientists in biotech, pharma, academic institutes and vendor companies are applying AI and ML for a variety of use case, such as models for adaptive clinical trials, imaging analytics (e.g. for pathology or clinical sample data), lead design, QSAR, analysing data streams from mobile monitoring devices, and more. Some snippets from the talks include:


Exome Sequencing in Rare Disease Research

Exome sequencing has become a very common tool in research of rare genetic diseases. The starting point is usually a family where several members share the same symptoms of an uncharacterized disease with presumably a genetic factor causing it. Once the exomes of a few affected individuals and their healthy parents, are sequenced, the data is ready to be analyzed, aiming to find the variant responsible for the disease phenotype. Many robust analysis tools and pipelines have been developed and are being used during the last decade or so. A typical analysis includes quality control filtration, alignment and variant calling which eventually yields a list of candidate variants, either tens or even many hundreds, which are then filtered to keep only the relevant ones.

Filtering Candidate Variants Demands Evidence

The next step, unravelling any significant biological associations for these gene variants, can be challenging.

Many approaches and criteria have been applied for the step of filtering variants, and accordingly different tools and software packages implement these approaches. For example, variants that show inconsistency between genotype and disease phenotype across samples are excluded, and ones who represent normal variability in the population regardless of disease (e.g. SNPs) are those that are filtered out. Also, significant alteration in protein structure or function based on the sequence variation is often used, or actual evidence of clinical effects (Polyphen and Clinvar respectively).


When people think about real-world evidence, they generally think about using these data to address questions around drug effectiveness, or population level safety effects. But there are many applications that “real-world data” can address. If you think of real-world data as any type of information gathered about drugs in non-trial settings, a whole world of possibilities opens up:

  • Social media data can be used to understand how well packaging and formulations are working.
  • Customer call feeds can be analyzed for trends in drug switching, off-label use, or contra-indicated medications among concomitant drugs.
  • Full text literature can be mined for information about epidemiology, disease prevalence, and more.

Text Mining transforms Real-World Data to Real-World Evidence

Many of these real-world sources have unstructured text fields, and this is where text analytics, and natural language processing (NLP), can fit in. At Linguamatics, we have customers who are using text analytics to get actionable insight from real-world data – and finding valuable intelligence that can inform commercial business strategies.

In this blog, we will be looking at two different Linguamatics customer use cases, where text mining has been used to transform real-world data to real-world evidence.


Ensuring patient safety is the highest priority for drug companies and prescribers – and obviously for patients themselves – so any steps that can give scientists and clinicians more accurate, well rounded descriptions of safety data should be welcomed by all parties. AstraZeneca (AZ) wanted test the hypothesis that adverse reaction (AR) information from patients could effectively supplement information from clinical trials, and a key challenge was assembling comparable data sets. AZ studied the commonly reported adverse reaction “nausea”: it is associated with many drugs, and there is a wealth of documented information – albeit in a variety of formats. It is also often debilitating, so anything to reduce its occurrence would be of value to patients.

Patient-reported Real-World Evidence

AZ worked with the patient-generated health data in the PatientsLikeMe system and looked for records reporting nausea as an adverse reaction. Because the PatientsLikeMe system is very well structured, it was relatively simple to extract a clean nausea AR data set that was amenable to comparison. 

Clinical Trial Events

Adverse reactions observed in clinical trials are included on drug labels and the data is then listed in the online DailyMed repository maintained by the National Library of Medicine. FDA only offers guidance on how to submit the data, so the content and formats are highly variable, and this complicated creating a well-structured data set to compare with the PatientsLikeMe real-world data.