NLP for FAIRification of unstructured data

Data is the lifeblood of all research, and pharmaceutical research is no exception. Clean accurate integrated data sets promise to provide the substrate for artificial intelligence (AI) and machine learning (ML) models that can assist with better drug discovery and development. Using data in the most effective and efficient way is critical - and improving scientific data management and stewardship through the FAIR (findable, accessible, interoperable, reusable) principles will improve pharma efficiency and effectiveness.

What is FAIR and how can NLP contribute?

The FAIR principles were first proposed in 2016, and this initial paper triggered not just discussions (see the recent Pistoia review paper), but in many organizations, the paper triggered action.

We are seeing applications of natural language processing (NLP) in FAIRification of unstructured data. Around 80% of information needed for pharma decision-making is in unstructured formats, from scientific literature to safety reports, clinical trial protocols to regulatory dossiers and more.

NLP can contribute to FAIRification of these data in a number of ways. NLP enables effective use of ontologies, a key component in data interoperability. Ensuring that life science entities are referred consistently across different data sources enables these data to be integrated and accessible for machine operations. Using a combination of strategies (e.g. ontologies, linguistic rules) NLP can transform unstructured data into formal representations, whether individual concept identifiers, or relationships such as RDF.


Precision medicine focuses on disease treatment and prevention, at the clinical level (in healthcare organisations), and within drug discovery and development (in pharma companies). Treatments are developed and delivered, taking into account the variability in genes, environment, and lifestyle between individual patients. Within the clinical arena, in order to understand the best treatment pathway for a particular patient or group of patients, it is important to be able to access and analyze detailed information from the medical records of patients, and ideally broader aspects beyond their medical history.

A great example of precision medicine within the clinical arena was presented at the Linguamatics seminar in Chicago in March 2019. At the University of Iowa, scientists at the Stead Family Children’s Hospital are working on a precision medicine research project. Alyssa Hahn (Graduate Student, Genetics) described how they are using Linguamatics Natural Language Processing (NLP) to extract phenotype details from electronic medical records of patients with suspected genetic disorders.


Spring is a lovely time to be in Cambridge – winter is finally moving on, the spring bulbs are out and the trees are in blossom. Time for Linguamatics Spring Text Mining Conference, which again this year was blessed with lovely sunshine. And of course, the opportunity to hear the latest about Linguamatics products and some new and fascinating use cases from our customers.

In March 2019, attendees from across pharma and healthcare came to our Spring Text Mining Conference, for hands-on workshops, a Healthcare Hackathon, networking and great presentations. The presentations covered innovations in using Natural Language Processing (NLP) to get more value from a range of unstructured text, covering electronic medical records, regulatory documents and patient social media verbatims.


Digital transformation often induces a disruption in our systems, in the way we use technology, human intelligence and processes to enhance business performance. Life science organizations are generally embracing the necessary digital revolution but digital transformation demands data transformation, which includes developing strategies to access information buried in text.

Data-driven decision making

During the upcoming Bio-IT World Conference & Expo, Jane Reed, Linguamatics Head of Life Science Strategy, will present a talk on "Natural Language Processing: enabling data-driven rather than document-driven decision making". Natural Language processing (NLP) allows organizations to focus on data-driven rather than document-driven decision making in a timely manner. The technology is already helping people in life sciences and healthcare, even non-programmers, to transform unstructured text into actionable structured data that can be rapidly visualized and analyzed, for decision support from bench to bedside.


Recently, I found myself in a discussion with some colleagues around whether artificial intelligence (AI) could increase commercial engagement and sales productivity; specifically, the Linguamatics flavour of AI, Natural Language Processing (NLP). My first reaction was no – our customers tend to use NLP to pull out critical information for safety assessment from internal reports, genotype-phenotype associations from literature, inclusion/exclusion criteria from clinical trial records; and many more examples that impact drug R&D.

But as the discussion progressed, I realized that as our customers drill more and more into the power of NLP to unlock value from real world data, the answer is actually yes. NLP enables data-driven, rather than document-driven decision support, by extracting key concepts and context from unstructured documents, which can then be rapidly reviewed and analysed. So, since much real world data is unstructured text, NLP can bring real productivity gains.

Challenges for pharma medical field teams

Let me give you some background, and then some examples.

Over recent years, pharma sales reps and medical science liaison staff (MSLs) have faced increasing challenges around access to Key Opinion Leaders (KOLs), physicians and prescribers, due to a more restrictive regulatory environment, new healthcare business models and evolving economic conditions. The boundaries for how pharma sales reps can interact with physicians are more limited, for example the “lunch and learn” meetings that used to be a key tool have been significantly curtailed. In parallel, the pressures on physicians to see more patients also reduces the time they have to learn about new drugs or improved therapies.