I read with interest a recent publication which sheds light on the complex interactions of synapse protein complexes with human disease.
The study (run by the Genes to Cognition neuroscience research programme) combined wet-lab research with bioinformatics and text analytics to uncover genetic associations with these protein complexes in over seventy human brain diseases, including Alzheimer’s Disease, Schizophrenia and Autism spectrum disorders.
The idea was to identify and develop suitable screening assays for synapse proteomes from post-mortem and neurosurgical brain samples, focusing specifically on Membrane-associated guanylate kinase (MAGUK) associated signalling complexes (MASC).
Our CTO, David Milward was involved in the text analytics work. He used the natural language processing capabilities of Linguamatics I2E platform to extract gene-mutation-disease associations from PubMed abstracts. The flexibility of I2E enabled an appropriate balance of recall and precision, thus providing comprehensive results while not overloading curators with noise. Queries were built using linguistic patterns to allow associations to be discovered between a list of several thousand relevant gene identifiers, and appropriate MedDRA disease terms.
The key aim was to provide comprehensive results with suitable accuracy to allow fast curation. These text-mined results were combined with data from Online Mendelian Inheritance in Man (OMIM) on human MASC genes and genetic disease associations.
In total, 143 gene-disease associations were found: 26 in both OMIM and extracted from PubMed abstracts via text-mining; 68 in OMIM alone; and 49 via text mining from PubMed alone.
I wanted to dig a little deeper into the data from the paper and the comparison of OMIM and PubMed. Supplementary Table 5 has information on the list of genes coding for MASC proteins and causing inherited diseases as described in the OMIM repository, or identified using text mining software as associated to disease. In total, 143 gene-disease associations were found (see Figure), but only 26 associations were found in both sources. This shows the synergistic value of combining data from these two sources, and the need for integration of multiple sources to get the fullest picture possible, for any particular gene-disease involvement.