Exome Sequencing in Rare Disease Research

Exome sequencing has become a very common tool in research of rare genetic diseases. The starting point is usually a family where several members share the same symptoms of an uncharacterized disease with presumably a genetic factor causing it. Once the exomes of a few affected individuals and their healthy parents, are sequenced, the data is ready to be analyzed, aiming to find the variant responsible for the disease phenotype. Many robust analysis tools and pipelines have been developed and are being used during the last decade or so. A typical analysis includes quality control filtration, alignment and variant calling which eventually yields a list of candidate variants, either tens or even many hundreds, which are then filtered to keep only the relevant ones.

Filtering Candidate Variants Demands Evidence

The next step, unravelling any significant biological associations for these gene variants, can be challenging.

Many approaches and criteria have been applied for the step of filtering variants, and accordingly different tools and software packages implement these approaches. For example, variants that show inconsistency between genotype and disease phenotype across samples are excluded, and ones who represent normal variability in the population regardless of disease (e.g. SNPs) are those that are filtered out. Also, significant alteration in protein structure or function based on the sequence variation is often used, or actual evidence of clinical effects (Polyphen and Clinvar respectively).


When people think about real-world evidence, they generally think about using these data to address questions around drug effectiveness, or population level safety effects. But there are many applications that “real-world data” can address. If you think of real-world data as any type of information gathered about drugs in non-trial settings, a whole world of possibilities opens up:

  • Social media data can be used to understand how well packaging and formulations are working.
  • Customer call feeds can be analyzed for trends in drug switching, off-label use, or contra-indicated medications among concomitant drugs.
  • Full text literature can be mined for information about epidemiology, disease prevalence, and more.

Text Mining transforms Real-World Data to Real-World Evidence

Many of these real-world sources have unstructured text fields, and this is where text analytics, and natural language processing (NLP), can fit in. At Linguamatics, we have customers who are using text analytics to get actionable insight from real-world data – and finding valuable intelligence that can inform commercial business strategies.

In this blog, we will be looking at two different Linguamatics customer use cases, where text mining has been used to transform real-world data to real-world evidence.


Ensuring patient safety is the highest priority for drug companies and prescribers – and obviously for patients themselves – so any steps that can give scientists and clinicians more accurate, well rounded descriptions of safety data should be welcomed by all parties. AstraZeneca (AZ) wanted test the hypothesis that adverse reaction (AR) information from patients could effectively supplement information from clinical trials, and a key challenge was assembling comparable data sets. AZ studied the commonly reported adverse reaction “nausea”: it is associated with many drugs, and there is a wealth of documented information – albeit in a variety of formats. It is also often debilitating, so anything to reduce its occurrence would be of value to patients.

Patient-reported Real-World Evidence

AZ worked with the patient-generated health data in the PatientsLikeMe system and looked for records reporting nausea as an adverse reaction. Because the PatientsLikeMe system is very well structured, it was relatively simple to extract a clean nausea AR data set that was amenable to comparison. 

Clinical Trial Events

Adverse reactions observed in clinical trials are included on drug labels and the data is then listed in the online DailyMed repository maintained by the National Library of Medicine. FDA only offers guidance on how to submit the data, so the content and formats are highly variable, and this complicated creating a well-structured data set to compare with the PatientsLikeMe real-world data.


Tracking and reporting adverse events

In recent years, regulatory authorities such as the FDA and EMA have placed an increased emphasis on drug safety of marketed products, particularly the tracking and reporting of adverse events. Pharmaceutical companies are expected to regularly screen the worldwide scientific literature for potential adverse drug reactions, at least every two weeks. The use of text mining and other tools to streamline the literature review process for pharmacovigilance is more crucial than ever in order to ensure patient safety, without overloading drug safety teams.

Manual review of adverse events is time-consuming

Eric Lewis (Safety Development Leader at GlaxoSmithKline) talked at the Linguamatics Text Mining Summit about the challenges of reviewing medical literature for safety signals. For example, he looked for literature for a sample of just 20 marketed products across a 300-day period. Eric found that there were on average 60 new references per day (with a total of over 11,000 documents). He found that manual review time was 1.2 to 1.6 minutes per abstract. He extrapolated this to a typical pharma company product portfolio of 200 marketed products, and showed that this volume of literature would take over 2,200 hours to review – hugely time-consuming.


Understanding drug-drug interactions can improve drug safety

A considerable proportion of adverse drug events are caused by interactions between drugs. With an ageing population, and associated increasing multiplicity of age-related illnesses, there is an increase in the potential for increased risk of drug-drug interactions (DDIs). One way of alleviating some DDIs is by ensuring that potentially interacting drugs are taken at suitable time intervals apart. But, what is the best interval to recommend?

In a recent seminar, Keith Burkhardt of the FDA described a project using text mining to survey the landscape of information on DDIs from FDA Drug Labels. And, in particular, the FDA review division wanted to find labelling for drugs where the time separation was stated, in order to prevent potential drug safety events.

Mining Data from FDA Drug Labels: dosing regimens and time separation

The drug classes of interest included bile acid sequestrants and exchange resins (such as cholestyramine, colestipol, colesevelam, all LDL cholesterol lowering drugs), phosphate binders (e.g. sevelamer; used for patients with chronic kidney failure), and chelators (used to treat excessively high levels of lead, iron or copper in the blood; e.g. deferasirox, deferiprone). These drug classes can all alter the bioavailability of other drugs, particularly for those with a narrow therapeutic range such as warfarin or antiepileptic drugs.