Skip to main content

NLP Content Store

Answer the most crucial questions from bench to bedside

The NLP content store contains over 350m+ biomedical documents 


Benefits   Why use the content store?

350m+ biomedical documents
350m+ biomedical documents
Ready to text mine to get the insights you need
6m+ biomedical terms
6m+ biomedical terms
Documents enhanced to include matches against ontologies
24/7 access to content
24/7 access to content
Accessible on the cloud, all content is updated weekly

Our solution   High value content – ready to text mine

Linguamatics content store provides access to the largest set ready-to-text-mine life science, biomedical and healthcare documents.

Answer key questions such as: 

  • What targets are involved in lung cancer?  
  • What companies are patenting a particular technology? 
  • What are the safety risks of my drug? 
  • How can I find the best site for my clinical trials? 

What content is available?

Our store is constantly being updated to add valuable content ready to mine. We can also add custom content if you have the required licenses (e.g. Embase, Springer, Copyright Clearance Center). 

Access to Linguamatics for text mining enables researchers to assess clinical trial inclusion/exclusion criteria for patient selection, trial site evaluation and study design as well as to discover competitive intelligence around companies, diseases, targets and novel drugs. 

FDA Adverse Events Reporting System (FAERS) 

As well as the standard ontologies, the FAERS index includes its own domain-specific ontologies containing classes which can be used to filter or display FAERS documents by the contents of their structured fields (sections within the documents). 

 FDA Drug Labels 

FDA Drug Labels provide a rich source of detailed intelligence on marketed drug products, including mechanism of action, pharmacology, safety/toxicity data, adverse events, contra-indications, and information on preclinical and clinical study outcomes. 

Gene Expression Omnibus (GEO) 

The Content Store GEO index contains enriched versions of each series with ontology mapping providing the ability to search for genes, organisms, chemicals, numerical information, etc. via synonyms, common names, etc. 


FDA Clinical Trials labels


PubMed is an excellent source of biomedical research knowledge, covering decades of published articles from academic journals covering biochemistry, medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care. 

NIH Grants 

Access to NIH grants can facilitate the development of new collaborations, and provide information on most recent research challenges, through the rapid discovery and recommendation of researchers, key opinion leaders, current expertise, and resources. 


Curated at John Hopkins University, OMIM has data on over 12000 genes, 5000 phenotypes, and provides a powerful resource for mining genotype-phenotype relationships, for target identification, personalized medicine and pharmacogenomics. Use cases for OMIM data include early discovery projects, to search for novel mechanisms and protein targets for disease areas; and in clinical projects to look at patient stratification, or diagnostic gene variant annotations. 

Patent - Abstracts 

As well as the standard ontologies, the Patents index includes its own domain-specific ontologies providing concepts specific to patent classification using Cooperative Patent Classification (CPC). 

Academic journals research

Patent - Full Text 

The Patent Solution allows users to generate powerful and bespoke queries for patent search and analysis, for patent landscapes, white space analyses, freedom-to-operate searches, research methodologies, competitive intelligence and state-of-the-art reviews for confident decision making. 

PubMed Central Open Subset 

PubMed Central provides a valuable source of biomedical research knowledge; in particular, access to the full-text papers can facilitate extraction of specific methods, assays, or details of healthcare costs, patient outcomes, and other in-depth information. 

Preprints - bioRxiv 

bioRxiv is a free online archive and distribution service for unpublished preprints in the life sciences. By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals. By posting on bioRxiv, authors explicitly consent to text mining of their work.

Preprints - medRxiv 

medRxiv is a free online archive and distribution service for complete but unpublished preprints in medical, clinical and related health sciences. medRxiv provides free and unrestricted access to all the articles posted on the server for both human readers and machine analysis.

Springer Nature 

The partnership with Springer Nature offers full-text article access to 600 Springer Nature life sciences journals from 1997 to 2020. This includes 76 Nature branded journals, with all content being updated as new articles are published. 

Online preprint archive

Enriched with Proprietary and standard Ontologies

All Linguamatics content sources are indexed with the Linguamatics standard set of domain-specific ontologies, for enriched semantic searches. Find details of all our ontologies below.

Biomedical Terminologies 

Linguamatics biomedical terminologies enable identification, extraction and normalization of over a million concepts, covering a wide variety of life science domains: diseases, genes, proteins, biomarkers, gene variants & mutations, phenotypes, drugs, adverse events, biological processes, organs, tissues and cells. 

Healthcare Terminologies 

Healthcare terminologies are integrated into Linguamatics platform covering key medical domains and categories. These are recognized using a combination of standard ontologies, pattern-based approaches and linguistic rules to enable the context around any patient variable to be taken into account (e.g. a family's history of disease). They are often used alongside the biomedical terminologies to maximize the amount of information that can be extracted from medical records. 

Healthcare terminologies are valuable for identifying key patient data from a variety of medical records, including patient problem lists, disease history and vital signs (blood pressure, heart rate, pulse, respiratory rate, temperature, gender and age). Lifestyle factors such as smoking, drug use, alcohol consumption, exercise, diet and sexual activity can also be analyzed. 

Chemical Entities 

Chemical entities can be found using ChEBI, MeSH and the NCI Thesaurus. In addition, the Linguamatics ChemAxon add-on identifies known and novel chemical structures within documents: by name, structure, substructure or similarity. 

Drug Terminologies

Linguamatics provides a pattern ontology that enables the identification and extraction of many different pharmaceutical company chemical identifiers (such as LY-170053, SQ 34676, ICI 204, 219). 

Numerical Data 

Linguamatics provides pattern ontologies that identify numerical data, such as times, dates, numerics, and units of measurement. These allow for the identification of concepts that can be expressed in many ways, extend search by annotating novel textual descriptions of key concepts or concept types and normalize results to greatly simplify downstream analysis. 

Organizations and People 

Information on organizations can be extracted and categorized by sector, type and geographic location. Searching by sector allows named pharmaceutical companies, universities or government agencies to be extracted. Organization types are also available, using linguistic rules and patterns to automatically detect whether an entity is a corporation, division, hospital or institute. Organizations can also be identified by geographical location (region, country, state or city). In addition, pattern ontologies allow for the identification of telephone numbers, names of people, and email addresses. 

Bespoke Vocabularies 

Linguamatics supports bespoke or custom vocabularies. These can be imported from academic or commercial sources. In-house vocabularies can also be employed, for example: a dictionary of employees from an organizational chart, or a controlled vocabulary for an internal drug development project. 

Source-specific Dictionaries 

Linguamatics incorporates data from the sources in the Content Store to provide source-specific dictionaries. These include Patent classification codes, listings of product names in FDA Drug Labels and specific FAERS terms.