Natural Language Processing Enhances Next-Generation Sequencing Data Analysis
Linguamatics
Exome Sequencing in Rare Disease Research
Exome sequencing has become a very common tool in research of rare genetic diseases. The starting point is usually a family where several members share the same symptoms of an uncharacterized disease with presumably a genetic factor causing it. Once the exomes of a few affected individuals and their healthy parents, are sequenced, the data is ready to be analyzed, aiming to find the variant responsible for the disease phenotype. Many robust analysis tools and pipelines have been developed and are being used during the last decade or so. A typical analysis includes quality control filtration, alignment and variant calling which eventually yields a list of candidate variants, either tens or even many hundreds, which are then filtered to keep only the relevant ones.
Filtering Candidate Variants Demands Evidence
The next step, unravelling any significant biological associations for these gene variants, can be challenging.
Many approaches and criteria have been applied for the step of filtering variants, and accordingly different tools and software packages implement these approaches. For example, variants that show inconsistency between genotype and disease phenotype across samples are excluded, and ones who represent normal variability in the population regardless of disease (e.g. SNPs) are those that are filtered out. Also, significant alteration in protein structure or function based on the sequence variation is often used, or actual evidence of clinical effects (Polyphen and Clinvar respectively).