Skip to main content

Standing on the shoulders of giants: NLP for effective literature landscapes

NLP for effective literature landscapes

Most research stands on the shoulders of giants, building on the work of others to inform current research and decisions. However, the vast and ever-expanding scale of scientific and biomedical literature means it is challenging to comprehensively assess the state of knowledge on a particular topic. Within pharma and healthcare, scientists and clinicians are looking to technological innovations, to provide effective strategies for reviewing literature. Using Natural Language Processing (NLP) for systematic or targeted literature reviews is not a novel idea but is becoming more common as this methodology is accepted widely. NLP transforms unstructured text in documents and databases into normalized, structured data suitable for review, analysis, visualizations, or machine learning models. Pharma and healthcare organizations have used IQVIA NLP (formerly Linguamatics) to search literature for use cases from bench to bedside, for target discovery, biomarkers, safety, clinical outcomes, medical affairs, and even to assist with patient treatment.  

There are three recent papers that I wanted to share with you that use IQVIA NLP as part of innovative methodological approaches, to get a deeper understanding of the current and historic science around particular topics, ranging from real world effectiveness data to target discovery.  

Real world data for drug dosing and clinical outcomes in obesity 

Jamieson et al (2022) from Pfizer wanted to get the landscape of evidence around Apixaban use in obese patients, particularly regarding the influence of extreme body weight on the pharmacokinetics (PK), pharmacodynamics (PD), efficacy, and safety of this direct oral anticoagulant (DOAC). The authors wanted to understand the landscape of information published on this topic, in order to better understand the real world effectiveness of Apixaban, in obese patients with or without associated comorbidities (e.g. nonvalvular atrial fibrillation). IQVIA NLP was used to search PubMed for relevant publications, with a search strategy that incorporated three criteria: a mention of apixaban or related drugs; some form of obesity; an indication approved in apixaban’s Label. NLP enabled the use of extensive synonyms for these, as well as relationships around drugs improving disease (rather than causing an adverse event, for example), and context around the population in the study. This methodology allowed the authors to “comprehensively review” the available literature and provide an optimal substrate for the final manual review and synthesis. The authors conclude that obesity does not substantially influence the efficacy, effectiveness, or safety of apixaban in these patients. This conclusion supports approved US and EU labeling and highlights that dose adjustment in high weight patients (as proposed by some earlier consensus guidelines) is not required.  

Discovery of novel druggable targets using NLP and machine learning 

Han at all (2022) from Sanofi, in their paper “Empowering the discovery of novel target-disease associations via machine learning approaches in the open targets platform” use machine learning models and integrated additional data with the Open Targets Platform to reveal new druggable “target to disease” associations. The aim was to discover strong target-indication hypotheses from machine learning models. The authors synthesize data from a number of additional sources (e.g. Gene Ontology annotations), and combine these new data features with the Open Targets data in three machine learning models. They then wanted to validate the best target-disease associations from the ML model, and for this, they turned to NLP, and literature search. They extracted the landscape of gene-disease associations from MEDLINE abstracts using IQVIA NLP, with a query that looked for druggable genes with currently no approved indications, and MeSH diseases.  The resulting normalized, structured output provided effective validation for the machine learning results. Using this workflow, the authors generated over 1200 target-indication combinations supported both by ML and NLP outputs, which have the potential to form the basis for drug discovery programs. 

Assessing validation studies for genome-wide association studies (GWAS) variants 

In my third example, Alsheikh et al (2022) from AbbVie use a comprehensive approach of NLP-based text mining and manual curation to understand experimental validations of GWAS, in their recent paper, “The landscape of GWAS validation; systematic review identifying 309 validated non-coding variants across 130 human diseases”. GWAS are used to identify genes associated with a particular disease or trait. GWAS examine the whole genome of a group of people, searching for variations that occur more frequently in people with a certain disease than in people without it.  

GWAS has been used for over 15 years, and the authors wanted to review the literature for studies that validated the variations found in the lab. However, there is now a large body of research in MEDLINE, and the authors found over 36k papers relevant; too many to review manually. They state, “As a traditional keyword-based search approach would not enable us to thoroughly search for all relevant concepts and combinations, we leveraged natural language processing (NLP) and ontology-based text mining to ensure a systematic identification of relevant validation articles”. This approach allowed them to automatically filter the set of potentially relevant papers to a more appropriate corpus of 1454 articles for manual review.  From the comprehensive review, they identified over 300 validated GWAS variants, regulating 252 genes across 130 human disease traits. These results underpin the potential for GWAS findings to translate to disease mechanisms and hence novel treatments.  

These papers add to the body of papers that use IQVIA NLP to transform unstructured text into structured output for effective review and decision support. NLP enables you to handle huge volumes of text, using a suite of tools (ontologies, linguistic patterns, chemical recognition, regex and more), and unlock rich scientific and clinical content. The papers reviewed above demonstrate that NLP is a key tool for literature research, allowing users to gain a comprehensive and systematic view of what has already been published, and reach new conclusions. 

NLP is used across many different textual data sources as well as published literature, and Linguamatics OnDemand content store gives users easy access to a suite of key life science sources, all ready to text-mine, including MEDLINE and PubMed Central, FDA drug labels,, Patents, Preprints, Gene Expression Omnibus, OMIM and more. To learn more, watch to our Content Store webinar, or contact us for more information.

Watch the webinar

Ready to get started?

Request a Demo

Questions? Ask our experts