Life sciences researchers who employ text mining and NLP (natural language processing) techniques to extract discrete facts and insights from scientific articles have generally relied on MEDLINE abstracts to define a corpus.

But there is increasing interest in mining full-text articles, with researchers experimenting on corpora sourced to Open Access repositories like PubMed Central (PMC). Organizations are eager to take advantage of the unique benefits full text provides, and rightly so.

Full-text content provides insights that researchers otherwise wouldn’t have had access to using abstracts alone. Here are three central benefits of mining a full-text corpus:

Volume. Full-text articles include more named entities and relationships between those entities than their corresponding abstracts – this is intuitively obvious when we consider the length of an abstract versus its full-text article. A study published in the Journal of Biomedical Informatics makes this point quantitatively: Only 7.84% of the scientific claims made in full-text articles are found in their abstracts.[i]


Wilmington, DE – The Pistoia Alliance, an organization dedicated to improving global life sciences R&D, has seen its membership increase with a number of new members including both large multinationals and start-ups.

The new members include Accenture, Linguamatics, Novaseek, Repositive, Agrimetrics and Daniel Taylor. Existing members upgrading to the new Startup membership category this quarter include KNIME, Scitegrity, Databiology, The Hyve, Binocular Vision, BioVariance, and Promeditec. This takes the membership of the Pistoia Alliance to over 80 globally, which includes many of the world’s biggest pharmaceutical companies, many of the most innovative start-ups and companies and organizations that support the life sciences sector.

Dr. Steve Arlington, Pistoia Alliance President said: “I am delighted to see that the Pistoia Alliance continues to show strong growth across all segments in life sciences, and we continue to attract a broad range of members. At the same time, we are also changing how we operate. Our challenge is to promote and encourage pre-competitive collaboration between our members, to benefit our members and ultimately accelerate the delivery of new drugs, devices and services to enhance performance within the sector. The Pistoia Alliance is well placed to help life sciences tackle many of its challenges and through our new strategy and the continued support of our members we will continue to support the global life sciences industry.”


Until recently, the use of natural language processing (NLP) in healthcare has been primarily limited to research efforts and population health within academic medical centers. However, with the proliferation of unstructured data from electronic medical records, providers are now seeking to harness the potential of their data and considering a variety of use cases for NLP technology.[1] That’s the conclusion of a recent KLAS report entitled “Natural Language Processing: Glimpses into the Future of Unstructured Data Mining.”

The report includes insights from 58 provider organizations and examines the various ways providers are currently leveraging NLP technology, as well as some of use cases poised for wider adoption. Coding and documentation applications represent the broadest use of NLP engines. But it is clear providers have a growing interest in NLP solutions that advance their population health initiatives. An increasingly popular use case, involves applications that use NLP to mine unstructured data within patient populations and include predictive analytics to identify at-risk patient populations.

 A few of the major findings from KLAS’s report are summarized below.

How is NLP being used today?


We are always enthused to read about new ways to utilize text mining in the drug discovery and development process, and very much enjoyed the recent paper by Heinemann et al., “Reflection of successful anticancer drug development processes in the literature”. In this study, the researchers develop tools that allow the prediction of the approval or failure of a targeted cancer drug, using models based on information mined from MEDLINE abstracts, along with a slew of other quantitative metadata (e.g. MeSH headings, author counts, fraction of authors with industry affiliation, and more). 

I2E, Linguamatics text mining platform, enabled the researchers to sytematically identify all MEDLINE abstracts containing both the protein target and the specific disease indication of a known set of successfully approved or failed cancer therapeutics; for example, abstracts containing both Her2 and breast cancer, or c-Kit and gastrointestinal stromal tumor (GIST). I2E enables the use of large vocabularies or ontologies of genes and diseases to extract key information, and the researchers used I2E for the rapid retrieval of publications containing any one of the many synonyms of a protein target or indication. 

The researchers found that the set of approved target-indication pairs showed a significantly higher publication count, from 9 years before FDA approval, compared to the eventually-failing pairs. 

Taking the study further, they applied machine learning classifiers and found that the extracted data features could be used to predict success or failure of target-indication pairs, and hence, approved or failed drugs. They conclude:


During his January 2015 State of the Union speech, President Obama announced details of his administration’s Precision Medicine Initiative, which promises to accelerate the development of tools and therapies that are customized to individual patients. Precision medicine focuses on disease treatment and prevention and considers the variability in genes, environment, and lifestyle between individual patients.

Precision medicine takes into account healthcare’s relatively minor role in impacting a patient’s overall health and well-being, compared to the larger roles of genetics, health behaviors, and social and environmental factors. The precision medicine approach thus requires that providers have access to a wealth of patient-specific data. Thanks to advancements in genetic testing and new technologies, such as patient portals and remote monitoring devices, a wide variety of patient data is now readily available. Unfortunately, clinicians may have difficulty extracting data that is clinically relevant because much of the information is stored in an unstructured format.

Consider how a physician would glean information from a paper medical chart prior to EMRs. To understand a patient’s complete health status, the doctor would search through pages and pages of notes - obviously a time-consuming and error-prone task.