Text Mining Full-Text Scientific Articles: More Facts, More Types of Facts, Faster

Life sciences researchers who employ text mining and NLP (natural language processing) techniques to extract discrete facts and insights from scientific articles have generally relied on MEDLINE abstracts to define a corpus.

But there is increasing interest in mining full-text articles, with researchers experimenting on corpora sourced to Open Access repositories like PubMed Central (PMC). Organizations are eager to take advantage of the unique benefits full text provides, and rightly so.

Full-text content provides insights that researchers otherwise wouldn’t have had access to using abstracts alone. Here are three central benefits of mining a full-text corpus:

Volume. Full-text articles include more named entities and relationships between those entities than their corresponding abstracts – this is intuitively obvious when we consider the length of an abstract versus its full-text article. A study published in the Journal of Biomedical Informatics makes this point quantitatively: Only 7.84% of the scientific claims made in full-text articles are found in their abstracts.[i]

Diversity. Beyond the increased volume of information present in a full-text article vs. its abstract, researchers are also more likely to find more diverse types of information in the full text. This is obvious in the case of, say, experimental and tabular data that can be represented in full within the body of an article, but that must necessarily be summarized in an abstract. But even summarized facts can be representatively skewed when looking only at abstracts. According to a study published in BMC Medical Research Methodology, abstracts published in high impact factor medical journals underreport harm even when the articles provide information in the main body of the article.”[ii]

Timeliness. Researchers are likely to discover scientific findings sooner by using full-text articles rather than abstracts. Following initial publication of a new discovery in a particular journal, the research is often repeated and published elsewhere. But there is a significant delay between when those findings are published initially in a full-text article and when the same information is included in the abstract of a subsequent article. In fact, it can take 1 – 2 years for a study finding present in a full-text article to appear in the abstract of a subsequent article.[iii]

The takeaway for researchers: For projects where you need comprehensiveness – access to more facts and more types of facts – and quicker paths to insights, text mining full-text articles versus abstracts is the clear choice.

Find out more about easy access to full-text articles for text mining in the Copyright Clearance Center and Linguamatics Solution Overview.


[i] Catherine Blake. “Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles.” Journal of Biomedical Informatics Volume 43, Issue 2, April 2010, Pages 173–189

[ii] Enrique Bernal-Delgado and Elliot S Fisher. “Abstracts in high profile journals often fail to report harm.” BMC Medical Research Methodology (2008); 8:14

[iii] Elsevier (2015) Harnessing the Power of Content - Extracting value from scientific literature: the power of mining full-text articles for pathway analysis. Available at