NLP for FAIRification of unstructured data
Data is the lifeblood of all research, and pharmaceutical research is no exception. Clean accurate integrated data sets promise to provide the substrate for artificial intelligence (AI) and machine learning (ML) models that can assist with better drug discovery and development. Using data in the most effective and efficient way is critical - and improving scientific data management and stewardship through the FAIR (findable, accessible, interoperable, reusable) principles will improve pharma efficiency and effectiveness.
What is FAIR and how can NLP contribute?
We are seeing applications of natural language processing (NLP) in FAIRification of unstructured data. Around 80% of information needed for pharma decision-making is in unstructured formats, from scientific literature to safety reports, clinical trial protocols to regulatory dossiers and more.
NLP can contribute to FAIRification of these data in a number of ways. NLP enables effective use of ontologies, a key component in data interoperability. Ensuring that life science entities are referred consistently across different data sources enables these data to be integrated and accessible for machine operations. Using a combination of strategies (e.g. ontologies, linguistic rules) NLP can transform unstructured data into formal representations, whether individual concept identifiers, or relationships such as RDF.
In addition, NLP can make legacy documents in data silos more accessible and re-usable, creating rich indexes and enabling an effective access layer on top of data sources. This allows users to make ad-hoc queries directly using concepts and relationships, not just keywords.
The case studies below give a couple of examples of where organizations are using Linguamatics NLP to FAIRify their unstructured data to create more value from these otherwise hard-to-access sources.
FDA: ontologies for adverse events, and FAIRfication for safety signal detection
The FDA use Linguamatics NLP as one of their suite of data mining tools. Keith Burkhardt from the FDA has talked about using Linguamatics NLP for MedDRA mappings, from FDA drug labels and other sources, to enable comparisons across literature, FDA drug labels, and FAERS reports. Without a consistent set of identifiers, analysis of potential safety signals is a laborious manual process; using NLP to standardize adverse events concepts to MedDRA enables signal detection and assessment of AE novelty.
Merck MSD: NLP workflow to access and re-use legacy safety data
Many pharma organizations use document repositories to store valuable legacy reports and files. However, the search functionality of many of these document management systems is limited and the data are silo’ed. Merck have established an automated NLP workflow to extract information from conclusion sections of final study reports stored in a Documentum-based repository. Final results are loaded into SALAR knowledgebase, and visualized via dashboards for the safety assessment teams.
This NLP workflow enables better access and re-use of these data, to benefit assessment of safety and selectivity of lead compounds currently in development. Early work published on this application of NLP to make data more accessible stated that this approach demonstrated that “clear and significant added value and new information can surface that provide vital support to clinical trial safety”.
So, as organizations embrace the digital transformation and move towards better data access and use, applying the FAIR principles will provide a clear path to value. AI technologies, such as NLP, can provide some of the tools necessary to implement these principles, making our data more findable, accessible, interoperable, reusable; and ultimately providing more value for R&D decision-making in the process.
Contact us if you want to learn more.