Skip to main content

Natural Language Processing Enhances Next-Generation Sequencing Data Analysis

Next-Generation Sequencing Data Analysis

Exome Sequencing in Rare Disease Research

Exome sequencing has become a very common tool in research of rare genetic diseases. The starting point is usually a family where several members share the same symptoms of an uncharacterized disease with presumably a genetic factor causing it. Once the exomes of a few affected individuals and their healthy parents, are sequenced, the data is ready to be analyzed, aiming to find the variant responsible for the disease phenotype. Many robust analysis tools and pipelines have been developed and are being used during the last decade or so. A typical analysis includes quality control filtration, alignment and variant calling which eventually yields a list of candidate variants, either tens or even many hundreds, which are then filtered to keep only the relevant ones.

Filtering Candidate Variants Demands Evidence

The next step, unravelling any significant biological associations for these gene variants, can be challenging.

Many approaches and criteria have been applied for the step of filtering variants, and accordingly different tools and software packages implement these approaches. For example, variants that show inconsistency between genotype and disease phenotype across samples are excluded, and ones who represent normal variability in the population regardless of disease (e.g. SNPs) are those that are filtered out. Also, significant alteration in protein structure or function based on the sequence variation is often used, or actual evidence of clinical effects (Polyphen and Clinvar respectively).

Natural Language Processing can Reveal Relevant Evidence from Unstructured Literature 

Generation and analysis of networks of candidate interacting genes/variants (e.g. genes in the same cellular pathways) provides a valuable approach. The scope of this part of exome analysis is immense, and incorporates many different sources of information. So why is text mining, or more specifically Natural Language Processing (NLP)-based search beneficial for such cases?

First and foremost, analysis of text using NLP can reveal information which is not available through databases’ records, and even raise new hypotheses. While databases contain user-submitted observations based on scientific findings or calculations, scientific text can describe more than just certain facts and also include possible interactions. NLP allows us to capture such relationships, for example, indirect or weak interactions between genes or proteins, which are part of a suggested mechanism of a disease. A hypothesis developed in one scientific publication might have high relevance to other studies, which without using text mining - would be missed.

In addition, having a wide range of domain-specific ontologies helps in accurate identification of gene/proteins names and symbols, mutations and diseases across platforms. These ontologies include synonyms, and when any of the synonyms are found in the right context and the right structure, we get a hit – high confidence, relevant and accurate results. In the context of exome data analysis – a few hits might make a big difference.

Lastly, text resources are not limited only to scientific literature and can include much more. Internal medical records, results of genetic testing, analysis reports of sequencing core facilities that analyze thousands of exome datasets on a regular basis, or population-specific genetic repositories. All of these sources can be indexed and searched in a similar way to Medline papers, using the power of NLP to extract more relevant information from your data.

Sanofi Applied Natural Language Processing to Discover New Genetic Biomarkers Associated with Multiple Sclerosis

One recent example for using NLP and text mining in genomics research is a study of genetic biomarkers associated with Multiple Sclerosis (MS) and other autoimmune diseases. As part of a biomarker study, researchers from Sanofi collected DNA and RNA samples from Multiple Sclerosis patients for HLA typing, Whole Exome Sequencing, RNA-seq and Genome Wide Association Studies. HLA is the most highly polymorphic region in the human genome, and some of its alleles are known to be associated with a higher risk of MS and other autoimmune disorders. Although several hundreds of alleles were identified in their HLA typing study, these were not annotated in any database.

For the purpose of annotating these alleles, Sanofi used Linguamatics I2E to create an index of full-text articles and abstracts, using ontologies including a bespoke HLA vocabulary. Sanofi developed a set of I2E queries to identify associations of the alleles with various autoimmune diseases, association of HLA haplotypes with diseases, and to find alleles that cause drug hypersensitivity. Although dozens of previous studies had been carried out on the association of HLA alleles with autoimmune diseases, many new alleles were annotated using this text-mining approach. 

To hear more, please watch our webinar on Text Mining at Sanofi for Genotype-Phenotype Associations in Multiple Sclerosis.

Access and watch the webinar

You can also learn more by visiting our gene-disease mapping application area, or watch the presentation of Dongyu Liu PhD, Associate Director, Translational Sciences at Sanofi from the Linguamatics Text Mining Summit 2017.

Ready to get started?

Request a Demo

Questions? Ask our experts