While Natural Language Processing (NLP)-based text mining has become a widely used technology within pharma, biotech and healthcare organizations, some still view NLP use as esoteric, only for experts. At AbbVie, however, the use of NLP has been democratized for researchers with the provision of broad-access web portals.

One project to illustrate their approach to broaden access to NLP was presented earlier this year at a Linguamatics NLP seminar. Abhik Seal, from the data science team at AbbVie, described an innovative web portal developed to provide a more effective search for pharmacokinetic and pharmacodynamic parameters, for pharmacometricians and chemists within the Clinical Pharmacology and Pharmacometrics (CPPM) group. Manual search of scientific abstracts and full-text papers is typically slow and laborious, particularly when extracting key pk/pd numerics and units, such as drug concentrations, exposures, efficacies, dosages and more.

The platform Seal’s team developed is known as PharMine (Figure 1). They have implemented a workflow (Figure 2) that takes in Medline abstracts and uses a suite of NLP queries to extract key information including:


Bone deformities, hearing loss, frequent respiratory infections, cognitive impairment and chronic heart and liver disorders are symptoms suffered by infants with Hunter syndrome (also known as Mucopolysaccharidosis II). This blog follows our previous research on associations between genotype and phenotype in very rare diseases, in collaboration with Shire. 

Shire, now part of Takeda, provides an enzyme replacement therapy for Hunter Syndrome. However, in order to ameliorate the neurocognitive effects, the enzyme replacement molecule needs to be delivered to the central nervous system (CNS) via an innovative implant device, which is an invasive procedure.


In her article in Rx Data, Jane Reed, Director Life Science at Linguamatics, discusses the impact of advanced data technologies (artificial intelligence and machine learning) on innovation in drug discovery, development and delivery.

We are now in the fourth industrial revolution (4IR), known to some as the Big Data Revolution. Advances in connectivity and communication, in the digital revolution, bring results such as improved data access and the new-found potential to analyze huge volumes of data. The ability to access these important volumes of varied data and to connect, integrate, query and analyze it is enabling fundamental changes in how we envision drug discovery and delivery in the clinic. Additionally, the pace of these changes is also remarkable; Jane notes a few examples of some genome-based projects and the fast-paced evolution, from the first human chromosome sequenced in 1999; the human genome published in draft in 2001, to the more recent UK 100k Genome project.

According to Jane, the key components for these innovations include data integration and data analysis. To keep up with that rhythm, pharma companies now need to join up genomic data with clinical information and knowledge about particular diseases.


NLP for FAIRification of unstructured data

Data is the lifeblood of all research, and pharmaceutical research is no exception. Clean accurate integrated data sets promise to provide the substrate for artificial intelligence (AI) and machine learning (ML) models that can assist with better drug discovery and development. Using data in the most effective and efficient way is critical - and improving scientific data management and stewardship through the FAIR (findable, accessible, interoperable, reusable) principles will improve pharma efficiency and effectiveness.

What is FAIR and how can NLP contribute?

The FAIR principles were first proposed in 2016, and this initial paper triggered not just discussions (see the recent Pistoia review paper), but in many organizations, the paper triggered action.

We are seeing applications of natural language processing (NLP) in FAIRification of unstructured data. Around 80% of information needed for pharma decision-making is in unstructured formats, from scientific literature to safety reports, clinical trial protocols to regulatory dossiers and more.

NLP can contribute to FAIRification of these data in a number of ways. NLP enables effective use of ontologies, a key component in data interoperability. Ensuring that life science entities are referred consistently across different data sources enables these data to be integrated and accessible for machine operations. Using a combination of strategies (e.g. ontologies, linguistic rules) NLP can transform unstructured data into formal representations, whether individual concept identifiers, or relationships such as RDF.


Precision medicine focuses on disease treatment and prevention, at the clinical level (in healthcare organisations), and within drug discovery and development (in pharma companies). Treatments are developed and delivered, taking into account the variability in genes, environment, and lifestyle between individual patients.

Within the clinical arena, in order to understand the best treatment pathway for a particular patient or group of patients, it is important to be able to access and analyze detailed information from the medical records of patients, and ideally broader aspects beyond their medical history.

A great example of precision medicine within the clinical arena was presented at the Linguamatics seminar in Chicago in March 2019. At the University of Iowa, scientists at the Stead Family Children’s Hospital are working on a precision medicine research project. Alyssa Hahn (Graduate Student, Genetics) described how they are using Linguamatics Natural Language Processing (NLP) to extract phenotype details from electronic medical records of patients with suspected genetic disorders.