Skip to main content

Content Store

Linguamatics provides access to a range of biomedical content options available from our cloud-based NLP products or via our connected data technology to those with on-premise installations. Each source is refreshed weekly by the team.

  1. ClinicalTrials.gov
  2. FDA Adverse Events Reporting System (FAERS)
  3. FDA Drug Labels
  4. Gene Expression Omnibus (GEO)
  5. MEDLINE
  6. NIH Grants
  7. OMIM
  8. Patent - abstracts
  9. Patent - full text
  10. PubMed Central Open Subset
  11. Preprints - bioRxiv
  12. Preprints - medRxiv
  13. Springer Nature

Linguamatics provides access to a range of content options, all accessible via I2E OnDemand or via our Connected Data Technology for those with an Enterprise installation. All documents are enhanced by including matches against a broad range of ontologies and the generation of document sections, such as title, abstract and author, in order to improve accuracy of extraction. Each source is refreshed weekly by Linguamatics, providing you with up-to-date information and access to the latest documents.

In additional to the following sources hosted by Linguamatics, we partner with Copyright Clearance Center to provide access to full-text journal articles via RightFind for XML Mining. Articles obtained in this way can then be accessed via I2E OnDemand.

ClinicalTrials.gov

ClinicalTrials.gov is a registry of federally and privately supported clinical trials conducted in the United States and around the world. ClinicalTrials.gov contains information about medical studies in human volunteers. Most of the records in ClinicalTrials.gov describe clinical trials (also called interventional studies). ClinicalTrials.gov also includes records describing observational studies and programs providing access to investigational drugs outside of clinical trial (expanded access). Studies listed in the database are conducted in all 50 US States as well as over 200 other countries.

As well as the standard ontologies, the Linguamatics ClinicalTrials.gov index includes its own domain specific ontology providing concepts specific to clinical trials e.g. Recruitment Status, Study Phase/Type/Design. Access to Linguamatics ClinicalTrials.gov for text mining enables researchers to assess clinical trial inclusion/exclusion criteria for patient selection, trial site evaluation and study design as well as to discover competitive intelligence around companies, diseases, targets and novel drugs.

Learn how I2E users can benefit from access to Clinicaltrials.gov

Case study: "Clinical Trials at AstraZeneca"

DOWNLOAD CASE STUDY

FDA Adverse Events Reporting System (FAERS)

FAERS is a rich source of ready to use safety surveillance data to extract information about safety concerns reported by users and clinicians.

It is typically used to monitor and discover safety issues of a marketed drug product. The data is managed and maintained by FDA and made available for public download. Linguamatics processes this XML to make it searchable as an I2E index.

As well as the standard ontologies, the FAERS index includes its own domain-specific ontologies containing classes which can be used to filter or display FAERS documents by the contents of their structured fields (sections within the documents).

FDA Drug Labels

The data is sourced from the DailyMed site hosted by the US National Library of Medicine. It contains high quality information about marketed drugs. This information includes up-to-date and accurate FDA drug labels (package inserts) that describe the composition, form, packaging, and other properties of the drug products in detail. As well as the standard ontologies, FDA Drug Labels index includes its own domain specific ontologies providing concepts specific to drug types and document classifications as defined by FDA.

FDA Drug Labels provide a rich source of detailed intelligence on marketed drug products, including mechanism of action, pharmacology, safety/toxicity data, adverse events, contra-indications, and information on preclinical and clinical study outcomes.

Gene Expression Omnibus (GEO)

The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes comprehensive sets of microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. GEO deposit procedures enable and encourage submitters to supply MIAME and MINSEQE compliant data. Use cases for GEO data include the identification of experimental results related to disease models, use of chemicals or chemical combinations in microarrays or finding published literature related to your own work in order to prioritize particular targets.

The Content Store GEO index contains enriched versions of each series with ontology mapping providing the ability to search for genes, organisms, chemicals, numerical information, etc. via synonyms, common names, etc. Sections within each semi-structured GEO accession as well as relationships between series, samples and platforms are preserved to allow for fielded search or to link information across different entries. Information extracted from the GEO index is structured and normalized for easy review and downstream processing and searches can be automated to ensure that latest updates are regularly available to you.

MEDLINE

MEDLINE® contains journal citations and abstracts for biomedical literature from around the world.

MEDLINE is the U.S. National Library of Medicine (NLM) premier bibliographic database that contains over 29 million references to journal articles in life sciences with a concentration on biomedicine. Each year there is a new release of the base distribution, usually in mid-December; thenthrough the rest of the year, approximately 2,000-4,000 references are added each day.

MEDLINE is an excellent source of biomedical research knowledge, covering decades of published articles from academic journals covering biochemistry, medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care.

NIH Grants

NIH Grants provides data on research projects funded by the National Institutes of Health (NIH), the Centers for Disease Control (CDC), the Food and Drug Administration (FDA), and the Department of Veterans Affairs (VA), their abstracts, and publications and patents citing support from these projects. The data are separated into four major categories of files: Projects, Project Abstracts, Publications, and Patents. The data is sourced from ExPORTER site (owned by NIH).
As well as the standard ontologies, NIH grants includes its own domain specific ontology providing concepts classifying the types of grants awarded. Access to NIH grants can facilitate the development of new collaborations, and provide information on most recent research challenges, through the rapid discovery and recommendation of researchers, key opinion leaders, current expertise, and resources.

OMIM

Online Mendelian Inheritance in Man® (OMIM) is a comprehensive catalogue of human genes and genetic conditions and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression.

Curated at John Hopkins University, OMIM has data on over 12000 genes and 5000 phenotypes, and provides a powerful resource for mining genotype-phenotype relationships, for target identification, personalized medicine and pharmacogenomics. Use cases for OMIM data include early discovery projects, to search for novel mechanisms and protein targets for disease areas; and in clinical projects to look at patient stratification, or diagnostic gene variant annotations.

Blog: "Synergy of OMIM and PubMed in Understanding Gene-Disease Associations for Synapse Proteins"

READ BLOG

Patent - Abstracts

This includes a complete set of patent abstracts (and additional citation information) from all patent agencies. The data is provided with a uniform structure to allow consistent searching across all sources regardless of their origin. The indexes are organized as a complete set or subdivided into era (last year, last 5 years, last 20 years etc).

As well as the standard ontologies, the Patents index includes its own domain specific ontologies providing concepts specific to patent classification using Cooperative Patent Classification (CPC).

Patent - Full Text

This includes a complete set of full text patents from USPTO, EPO and WIPO. The documents are provided with a uniform structure to allow consistent searching across all sources regardless of their origin. The indexes are organized as a complete set, individual authority or subdivided into era (last year, last 5 years, last 20 years etc).

As well as the standard ontologies, the Patents index includes its own domain specific ontologies providing concepts specific to patent classification using Cooperative Patent Classification (CPC). The I2E Patent Solution allows users to generate powerful and bespoke queries for patent search and analysis, for patent landscapes, white space analyses, freedom-to-operate searches, research methodologies, competitive intelligence and state-of-the-art reviews for confident decision making.

PubMed Central Open Subset

This is part of the total collection of articles in PubMed Central (PMC), which is an archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health’s National Library of Medicine (NIH NLM).

PMC Open Subset is an electronic archive of full-text journal articles under the Creative Commons License, offering more liberal redistribution and re-use of the content than traditional copyrighted work. As with MEDLINE abstracts, PubMed Central provides a valuable source of biomedical research knowledge; in particular, access to the full text papers can facilitate extraction of specific methods, assays, or details of healthcare costs, patient outcomes, and other in-depth information.

Preprints - bioRxiv

bioRxiv is a free online archive and distribution service for unpublished preprints in the life sciences. It is operated by Cold Spring Harbor Laboratory, a not-for-profit research and educational institution. By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals. By posting on bioRxiv, authors explicitly consent to text mining of their work. The preprints are organized as a complete set or subdivided by terms of use.

Preprints - medRxiv

medRxiv is a free online archive and distribution service for complete but unpublished preprints in medical, clinical and related health sciences. It is also operated by Cold Spring Harbor Laboratory. medRxiv provides free and unrestricted access to all the articles posted on the server for both human readers and machine analysis. The preprints are organized as a complete set or subdivided by terms of use.

Springer Nature

Springer Nature is one of the world’s largest and most influential publisher of scientific & technical books, journals, databases, and open research. Subscriptions to Springer’s online library & databases support technical learning and research at companies of all sizes, from ICT to the Life Sciences.

Linguamatics is now offering Springer Nature’s large collection of influential journals directly through our Linguamatics NLP platform.  The partnership with Springer Nature offers full-text article access to 600 Springer Nature life sciences journals from 1997 to 2020. This includes 76 Nature branded journals, with all content being updated as new articles are published. We recognize every organization is different, so if you have specific content needs, we can customize those needs as well.

Ready to get started?

Request a Demo

Questions? Ask our experts