Mining Nuggets from Safety Silos

May 26 2015

Better access to the high value information in legacy safety reports has been, for many folk in pharma safety assessment, a “holy grail”. Locked away in these historical data are answers to questions such as:  Has this particular organ toxicity been seen before? In what species, and with what chemistry? Could new biomarker or imaging studies predict the toxicity earlier? What compounds could be leveraged to help build capabilities?


I2E enables extraction and integration of historical preclinical safety information, crucial to optimizing investment in R&D, alleviating concerns where preclinical observations may not be human-relevant, and reducing late stage failures.

Coming as I do from a decade of working in data informatics for safety/tox prediction, I was excited by one of the talks at the recent Linguamatics Spring User conference. Wendy Cornell (ex-Merck) presented on an ambitious project to use Linguamatics text mining platform, I2E, in a workflow to extract high value information from safety assessment reports stored in Documentum.

Access to historic safety data is a potential advantage that will be helped with the use of standards in electronic data submission for regulatory studies (e.g. CDISC’s SEND, the standard for exchange of non-clinical data).

Standardizing the formats and vocabularies for key domains in safety studies will enable these data to be fed into searchable databases; however these structured data miss the intellectual content added by the pathologists and toxicologists, whose conclusions are essential for understanding whether evidence of a particular tox finding (e.g. hyalinosis, single cell necrosis, blood enzyme elevations) signals a potential serious problem in humans or is specific to the animal model.

For these key conclusions, access to the full study reports is essential.

At Merck, Wendy’s Proprietary Information and Knowledge Management group, in collaboration with the Safety Assessment and Laboratory Animal Resources (SALAR) group, developed an I2E workflow that extracted the key findings from safety assessment ante- and post-mortem reports, final reports, and protocols, in particular pulling out:

  • Study annotation (species, study duration, compound, target, dosage)
  • Interpreted results sections (i.e. summary or conclusions sections)
  • Organ-specific toxicology and histopathology findings
  • Haematological and serum biochemistry findings

In addition, a separate arm in the workflow leveraged the ABBYY OCR software to extract toxicokinetic (TK) parameters such as area under the curve (AUC), maximum drug concentration (Cmax), and time after dosing of peak drug plasma exposure (TMax) from PDF versions of the documents.

The extracted and normalized information was loaded into a semantic knowledgebase in the Cambridge Semantics ANZO tool and searched and visualized using a tailored ANZO dashboard. This faceted browsing environment enabled the SALAR researchers to ask questions such as, “what compounds with rat 3-month studies show kidney effects, and for these compounds, what long term studies do we have?”

Wendy presented several use cases showing real value of this system to the business, including the potential to influence regulatory guidelines. For example, the team were able to run an analysis to assess the correlation between 3-month sub-chronic non-rodent studies, and 9- or 12-month chronic non-rodent results; they found that in nearly 30% of cases an important new toxicologic finding was identified in the long-term studies, confirming the ongoing need for long-term studies.

Wendy stated, “This unstructured information represents a rich body of knowledge which, in aggregate, has potential to identify capability gaps and evaluate individual findings on active pipeline compounds in the context of broad historical data.”

With the current focus on refinement, replacement and reduction of animal studies, being able to identify when long-term studies are needed vs. when they are not essential for human risk assessment, will be hugely valuable; and extracting these nuggets of information from historic data will contribute to this understanding.

Expert interpretations and conclusions from thousands of past studies can potentially be converted into actionable knowledge. These findings exist as unstructured text in Safety documents. See Wendy Cornell speak on this, at our upcoming NLP and Big Data Symposium in San Francisco.

Learn more about drug development safety