NLP for FAIRification of unstructured data
Data is the lifeblood of all research, and pharmaceutical research is no exception. Clean accurate integrated data sets promise to provide the substrate for artificial intelligence (AI) and machine learning (ML) models that can assist with better drug discovery and development. Using data in the most effective and efficient way is critical - and improving scientific data management and stewardship through the FAIR (findable, accessible, interoperable, reusable) principles will improve pharma efficiency and effectiveness.
What is FAIR and how can NLP contribute?
We are seeing applications of natural language processing (NLP) in FAIRification of unstructured data. Around 80% of information needed for pharma decision-making is in unstructured formats, from scientific literature to safety reports, clinical trial protocols to regulatory dossiers and more.
NLP can contribute to FAIRification of these data in a number of ways. NLP enables effective use of ontologies, a key component in data interoperability. Ensuring that life science entities are referred consistently across different data sources enables these data to be integrated and accessible for machine operations. Using a combination of strategies (e.g. ontologies, linguistic rules) NLP can transform unstructured data into formal representations, whether individual concept identifiers, or relationships such as RDF.