
NLP Extracts Genotype-Phenotype Associations for Rare Disease
A rare disease is any disease that affects a small percentage of the population. Rare diseases are complex; with over 7,000 such conditions known, 80% are genetic disorders and 50% of those affected are children. Information associated with rare diseases is essential at many stages of drug discovery and development, but a fundamental challenge in researching rare disease is gaining access to enough relevant data. Patients are difficult to find, essential information is siloed, and often buried in unstructured text - in different data sources, with differing formats, vocabs, etc.
Patient registries help address some of these challenges by collecting and organizing patient and other data, and there are numerous rare disease registries. Patient registries are valuable for rare disease research as rare disease clinical trials have limited cohort size, and often do not run long enough for pathologies to manifest. A patient registry gives a more complete longitudinal view than a clinical trial alone.
Hunter Outcome Survey, a patient registry for Hunter Syndrome
Takeda sponsors the Hunter Outcome Survey (HOS) registry. HOS covers patients with the rare disease Hunter Syndrome (also known as Mucopolysaccharidosis type II), which is treated by Takeda’s enzyme replacement therapy, Elaprase. The disease is caused by deficient activity in an enzyme encoded by the IDS (iduronate-2-sulfatase) gene, and relating IDS mutations to patient outcomes is one of the aims of HOS.
The HOS patient registry is a global, multi-center, long-term observational survey that covers 1000 patients in 124 clinics in 29 countries, and detailed data is collected using 100 HOS case report forms. One includes results of any genotypic analysis, including mutations. Data update involves manual curation of incoming data, automated QC, and annotation of the mutation data, including noting if the observed mutation has literature precedence.
Using NLP to Extract Genotype-Phenotype Associations for HOS
To better understand Hunter Syndrome biology, Takeda wanted to mine Hunter Syndrome papers to identify IDS genetic mutations associated with the syndrome, and to associate related genotypes with phenotypes and disease severity. These data then feed into the Hunter Outcome Survey registry.
Takeda was using manual text mining to scan the literature and extract IDS mutation information to add to HOS. This was labor-intensive, time-consuming, and incomplete, so Takeda switched to Linguamatics natural language processing (NLP) text mining to improve the process to add valuable genotype-phenotype associations to HOS.
Takeda created a corpus of published information on Hunter Syndrome studies, using Linguamatics NLP to search PubMed abstracts for papers that mentioned IDS and Hunter syndrome. NLP queries were developed to search the full text papers for gene variants and mutations, extracting key information from text and embedded tables.
Around 400 unique mutations, some previously undetected by manual scanning, were revealed with relationships to Hunter Syndrome phenotypic information from the papers. These data are fed into the HOS data update process for access by Hunter Syndrome stakeholders worldwide.
NLP supports understanding of Hunter Syndrome
Takeda’s continued support of the HOS patient registry requires on-going literature scanning for publications on the disease, mutations of the key IDS gene, and associated phenotypic information. The NLP-powered data extraction/annotation process avoids considerable manual processing time, with more comprehensive detection and extraction of genotype-phenotype associations for Hunter Syndrome. These newly extracted genotype-phenotype associations in HOS add to the understanding of Hunter Syndrome rare disease biology, and help identify risk factors for the severe phenotype.
Learn more about how Linguamatics NLP text mining can be used to extract insights into rare disease biology from structured and unstructured published data on our precision medicine page.
