Natural Language Processing: Standing on the shoulders of giants

Many of us know the joys and sorrows of research. Weeks, months and years can pass, developing hypotheses, working in the lab or clinic, analyzing results, sometimes going back to square one, but then writing the paper, and finally, seeing the final version published and in print. The intent is that your research is shared, discussed, re-used, so that others can build on it, “standing on the shoulders of giants,” as Isaac Newton famously said.

Traditionally, getting information out of written papers for re-use has been manual; individuals reading, reviewing and extracting the key facts from tens or hundreds of papers by hand, in order to summarize the most up to date research in a field, or understand the landscape of information around a particular research topic. Over the past few decades Artificial Intelligence (AI) tools, such as Natural Language Processing (NLP), have evolved that can hugely speed up and improve this data extraction. NLP solutions can enable researchers to access information from huge volumes of scientific abstracts and literature; developing strategies and rules that drill deep into literature for hidden nuggets, or more broadly, ploughing the landscape for the nuggets of desired information.

To give you a couple of examples, I’ll share two use cases, both published recently, that use Linguamatics NLP platform across published literature, enabling researches to benefit from years of previous research.


It is well known that the drug discovery and development process is lengthy, expensive and prone to failure. Starting from the selection of a novel target in discovery, through the multiple steps to regulatory approval, the overall probability of success is less than 1%.

One factor is that the majority of diseases are multifaceted, hence the challenge is identifying the most appropriate patient populations who will respond to specific interventions. A stratified approach has proven beneficial in a number of cancers and genetic diseases, and pharmaceutical companies have a strong interest in understanding how to find the sub-populations of patients to ensure the most appropriate therapies are tested in clinical trials, and applied in broader clinical use.

The ultimate aim of a stratified approach to medicine is to enable healthcare professionals to provide the “right treatment, for the right person, at the right dose, at the right time”; and there are many research initiatives (governmental, private, public) on-going to develop the appropriate knowledge and models.


While Natural Language Processing (NLP)-based text mining has become a widely used technology within pharma, biotech and healthcare organizations, some still view NLP use as esoteric, only for experts. At AbbVie, however, the use of NLP has been democratized for researchers with the provision of broad-access web portals.

One project to illustrate their approach to broaden access to NLP was presented earlier this year at a Linguamatics NLP seminar. Abhik Seal, from the data science team at AbbVie, described an innovative web portal developed to provide a more effective search for pharmacokinetic and pharmacodynamic parameters, for pharmacometricians and chemists within the Clinical Pharmacology and Pharmacometrics (CPPM) group. Manual search of scientific abstracts and full-text papers is typically slow and laborious, particularly when extracting key pk/pd numerics and units, such as drug concentrations, exposures, efficacies, dosages and more.

The platform Seal’s team developed is known as PharMine (Figure 1). They have implemented a workflow (Figure 2) that takes in Medline abstracts and uses a suite of NLP queries to extract key information including:


Bone deformities, hearing loss, frequent respiratory infections, cognitive impairment and chronic heart and liver disorders are symptoms suffered by infants with Hunter syndrome (also known as Mucopolysaccharidosis II). This blog follows our previous research on associations between genotype and phenotype in very rare diseases, in collaboration with Shire. 

Shire, now part of Takeda, provides an enzyme replacement therapy for Hunter Syndrome. However, in order to ameliorate the neurocognitive effects, the enzyme replacement molecule needs to be delivered to the central nervous system (CNS) via an innovative implant device, which is an invasive procedure.


In her article in Rx Data, Jane Reed, Director Life Science at Linguamatics, discusses the impact of advanced data technologies (artificial intelligence and machine learning) on innovation in drug discovery, development and delivery.

We are now in the fourth industrial revolution (4IR), known to some as the Big Data Revolution. Advances in connectivity and communication, in the digital revolution, bring results such as improved data access and the new-found potential to analyze huge volumes of data. The ability to access these important volumes of varied data and to connect, integrate, query and analyze it is enabling fundamental changes in how we envision drug discovery and delivery in the clinic. Additionally, the pace of these changes is also remarkable; Jane notes a few examples of some genome-based projects and the fast-paced evolution, from the first human chromosome sequenced in 1999; the human genome published in draft in 2001, to the more recent UK 100k Genome project.

According to Jane, the key components for these innovations include data integration and data analysis. To keep up with that rhythm, pharma companies now need to join up genomic data with clinical information and knowledge about particular diseases.