Skip to main content

Text analytics uncovers genotype-phenotype associations in a multiple sclerosis biomarker discovery project at Sanofi

Image depicting genotype-phenotype-associations

A key requirement in drug development – and increasingly in precision/personalized medicine and pharmacogenomics – is a comprehensive understanding of the genetic associations for the disease of interest. For a multiple sclerosis (MS) biomarker discovery project, Sanofi wanted to annotate the association of human leukocyte antigen (HLA) alleles and haplotypes with diseases and drug hypersensitivity, as the HLA genotype is responsible for some 30% of the risk of MS and participates in almost every aspect of the disease.

HLA alleles have been associated with multiple autoimmune diseases, various types of cancer, infectious disease, and drug adverse events, but there are no known resources that systematically annotate these associations.

Developing a Comprehensive Catalog of Disease Annotations using Natural Language Processing (NLP)-based Text Analytics

Sanofi identified more than 400 HLA alleles through a whole exome sequencing-based HLA typing and analysis workflow. These potential candidate biomarkers were not annotated in any database. Sanofi then used the Linguamatics NLP platform to analyse and search the literature to annotate the association of the identified HLA alleles with diseases and drug hypersensitivity.

Sanofi linguistically processed and indexed a literature corpus of 25 million PubMed abstracts and 4 million full text journal articles with I2E text analytics, using an internally developed HLA gene ontology, alongside Linguamatics I2E’s dictionary of relationship verbs (e.g. causes, leads to, results in) and Diseases ontology. This identified HLA alleles and haplotypes and their relationships with diseases and drug sensitivity.  

Uncovering New Disease and Drug Sensitivity Associations

The Linguamatics I2E text mining query identified all the 22 previously published autoimmune diseases associated with HLA alleles and uncovered an additional 34 previously unpublished disease and drug sensitivity associations. These known and novel associations were stored in a database that can be searched through a simple web interface for HLA alleles and diseases. The results are curated by experts, and the curated annotations are saved back into the knowledge base for wider use by other researchers within the Sanofi team in its search for novel biomarkers.


The discovery of an additional 33 novel unpublished HLA allele disease and drug sensitivity associations provided Sanofi with a broader and more comprehensive knowledge base from which they can now confidently explore potential new biomarkers for multiple sclerosis.

To learn more about how Sanofi used I2E to uncover novel genotype-phenotype associations in its search for new biomarkers, and its other I2E applications from bench to bedside:

Download the full case study

Learn more about Biomarker discovery

Ready to get started?

Request a Demo

Questions? Ask our experts