Text Analytics for Systematic Coverage of Knock-Out Mice Models of Autoimmune Disease

December 14 2015

Animal models are crucial in the understanding of disease, the underlying pathways and the gene targets that play a role. One tool that has shown great value is the knockout mouse model.

The number of KO mouse models has increased massively since the first one in 1989, and mice models have been used successfully in increasing our understanding of diseases as varied as different cancers, diabetes, obesity, blindness, Huntington's disease, aggressive behaviour, and even drug addiction.

Understanding the landscape of KO mouse models for any particular disease area is important, and curated databases (e.g. IMPC or MGI) provide valuable data, but keeping track of new KO mouse models published in the scientific literature is challenging.

Peng Zhang, ‎Senior Staff Scientist at Regeneron Pharmaceuticals, uses Linguamatics I2E to tackle this challenge, and he presented on “Text Mining for Knockout Mice and Phenotypes” earlier this year.

 Diagram showing the set of KO genes involved in autoimmune phenotypes. All hits from both I2E and MGI were manually curated and only 479 unique KO genes were considered “true positive”. 61% true positives only came from I2E query and were not covered by MGI.

Results: Diagram showing the set of KO genes involved in autoimmune phenotypes.  All hits from both I2E and MGI were manually curated and only 479 unique KO genes were considered “true positive”. 61% true positives only came from I2E query and were not covered by MGI.

 

Dr. Zhang uses I2E to systematically mine the scientific literature for any reported gene knockout in mice with the associated autoimmune phenotype. This process feeds into Regeneron’s proprietary VelociGene® technology, which allows rapid and precise manipulation of large pieces of DNA for engineering the mouse genome.

To search for autoimmune phenotypes needs broad vocabularies, with more than 200 disease classes and subtypes in the MeSH and NCI Thesaurus disease vocabularies provided within the I2E platform. In addition, Dr. Zhang used the flexibility of I2E to include nearly 200 bespoke keywords into his queries. More complex was the strategy for capturing gene knockouts, as descriptions of KO in scientific literature are very heterogeneous (see Figure, below).

Query strategies were developed to cope with false negatives, for example with non-standard characters in disease names, or where gene names were directly connected to the KO keywords; and also for false positives, for example where the phenotype was not related to the genotype (“Loss of Roquin induces early death and immune deregulation but not autoimmunity”).

The results were manually curated and analyzed for performance metrics. By combining two different I2E search strategies for identifying knockouts, a high recall rate of more than 90% was achieved. Compared with the gold standard MGI knockout phenotype database, I2E provided a large number of additional hits (see Results Piechart, above).

Dr. Zhang said: “Linguamatics I2E exhibited superior performance compared with gold standard MGI resource for identification of knockouts associated with autoimmune phenotypes”

The results can be used in studying genes’ biological function and their therapeutic implications, and shows the value of text analytics for extracting systematic up-to-date and detailed information around genotype-phenotype associations.

 

Figure 1: Snapshot of the patterns used within the I2E GUI to find and extract terms for knock-out genes and mouse models. Descriptions of KO in scientific literature are very heterogeneous, for example:  IL-2 knockout mouse model …  Ace2(-/-) mice were crossed with ...   mice harbor a loss-of-function mutation in the gene encoding …   MMP-2 null mice exhibit …    CCR2-deficient mice more susceptible …   germline deletion of CD1d exacerbates …   disruption of the C1qa gene …   Lack of SIGIRR/TIR8 aggravates …