Leverage Natural Language Processing to make your data FAIR-er

June 18 2019

NLP for FAIRification of unstructured data

Data is the lifeblood of all research, and pharmaceutical research is no exception. Clean accurate integrated data sets promise to provide the substrate for artificial intelligence (AI) and machine learning (ML) models that can assist with better drug discovery and development. Using data in the most effective and efficient way is critical - and improving scientific data management and stewardship through the FAIR (findable, accessible, interoperable, reusable) principles will improve pharma efficiency and effectiveness.

What is FAIR and how can NLP contribute?

The FAIR principles were first proposed in 2016, and this initial paper triggered not just discussions (see the recent Pistoia review paper), but in many organizations, the paper triggered action.

We are seeing applications of natural language processing (NLP) in FAIRification of unstructured data. Around 80% of information needed for pharma decision-making is in unstructured formats, from scientific literature to safety reports, clinical trial protocols to regulatory dossiers and more.

NLP can contribute to FAIRification of these data in a number of ways. NLP enables effective use of ontologies, a key component in data interoperability. Ensuring that life science entities are referred consistently across different data sources enables these data to be integrated and accessible for machine operations. Using a combination of strategies (e.g. ontologies, linguistic rules) NLP can transform unstructured data into formal representations, whether individual concept identifiers, or relationships such as RDF.

In addition, NLP can make legacy documents in data silos more accessible and re-usable, creating rich indexes and enabling an effective access layer on top of data sources. This allows users to make ad-hoc queries directly using concepts and relationships, not just keywords.

The case studies below give a couple of examples of where organizations are using Linguamatics NLP to FAIRify their unstructured data to create more value from these otherwise hard-to-access sources.

FDA: ontologies for adverse events, and FAIRfication for safety signal detection

The FDA use Linguamatics NLP as one of their suite of data mining tools. Keith Burkhardt from the FDA has talked about using Linguamatics NLP for MedDRA mappings, from FDA drug labels and other sources, to enable comparisons across literature, FDA drug labels, and FAERS reports. Without a consistent set of identifiers, analysis of potential safety signals is a laborious manual process; using NLP to standardize adverse events concepts to MedDRA enables signal detection and assessment of AE novelty.

Fig. 1 Workflow used in assessment of potential safety signals across disparate data sources. NLP was used to generate the Target-Adverse Event profiles (step 3) across 3 different sources, mapping the AE to MedDRA to enable comparison. Keith was able to increase AE recall with linguistic strategies e.g. morphological variants; spelling correction; matching across conjunctions; and increases precision using linguistic context and utilising document regions. Compared to manual gold standard, Linguamatics NLP gave results with excellent metrics, including an F-score of 0.95.

Merck MSD: NLP workflow to access and re-use legacy safety data

Many pharma organizations use document repositories to store valuable legacy reports and files. However, the search functionality of many of these document management systems is limited and the data are silo’ed. Merck have established an automated NLP workflow to extract information from conclusion sections of final study reports stored in a Documentum-based repository. Final results are loaded into SALAR knowledgebase, and visualized via dashboards for the safety assessment teams.

This NLP workflow enables better access and re-use of these data, to benefit assessment of safety and selectivity of lead compounds currently in development. Early work published on this application of NLP to make data more accessible stated that this approach demonstrated that “clear and significant added value and new information can surface that provide vital support to clinical trial safety”.

Fig. 2 Schematic showing utilization of NLP to assist in making historic safety data more accessible and reusable. 1. Study annotation: uses ontologies and bespoke Merck vocabularies to find and normalize key metadata for each study. 2. Interpreted Section location: Linguamatics NLP identifies specific regions (summary, conclusions) so that key concepts are found in the correct context, reducing noise. 3. Normalizing organ-toxicology histopathology and 4. Hematology/serum biochemistry findings to a unified ontology enable interoperability of text-mined data with structured results. Overall, the workflow provides a historical summary of findings to assess significance of new findings on pipeline compounds.

So, as organizations embrace the digital transformation and move towards better data access and use, applying the FAIR principles will provide a clear path to value. AI technologies, such as NLP, can provide some of the tools necessary to implement these principles, making our data more findable, accessible, interoperable, reusable; and ultimately providing more value for R&D decision-making in the process.


Contact us if you want to learn more.