We are always enthused to read about new ways to utilize text mining in the drug discovery and development process, and very much enjoyed the recent paper by Heinemann et al., “Reflection of successful anticancer drug development processes in the literature”. In this study, the researchers develop tools that allow the prediction of the approval or failure of a targeted cancer drug, using models based on information mined from MEDLINE abstracts, along with a slew of other quantitative metadata (e.g. MeSH headings, author counts, fraction of authors with industry affiliation, and more).
I2E, Linguamatics text mining platform, enabled the researchers to sytematically identify all MEDLINE abstracts containing both the protein target and the specific disease indication of a known set of successfully approved or failed cancer therapeutics; for example, abstracts containing both Her2 and breast cancer, or c-Kit and gastrointestinal stromal tumor (GIST). I2E enables the use of large vocabularies or ontologies of genes and diseases to extract key information, and the researchers used I2E for the rapid retrieval of publications containing any one of the many synonyms of a protein target or indication.
The researchers found that the set of approved target-indication pairs showed a significantly higher publication count, from 9 years before FDA approval, compared to the eventually-failing pairs.
Taking the study further, they applied machine learning classifiers and found that the extracted data features could be used to predict success or failure of target-indication pairs, and hence, approved or failed drugs. They conclude:
“These patterns allow predicting success of drugs in Phase II or III with remarkably high accuracy”.
The authors propose that these methods (combining text mining with machine learning) could easily be applied to other disease areas, and provide an indication of the direction of successful development. I look forward to seeing if this prediction holds true, as any tool that can reduce drug attrition rates is highly valuable.
In many ways, these results are a validation of the increasing call for better basic understanding of the natural history of disease, whether in oncology or rare diseases. In 2014, AstraZeneca authors published a major review of their pipeline, which found that some of the decline in R&D productivity was because “the focus of scientists and clinicians moved away from the more demanding goal of thoroughly understanding disease pathophysiology and the therapeutic opportunities”. They also concluded that as the knowledge of the underlying disease biology increases, the likelihood of success does too.
So it is exciting that the approach taken in this paper, utilizing text mining and machine learning, could potentially enable drug developers to, as the authors state, “improve the prioritization of drug portfolios and contribute to an increase in the overall productivity of pharmaceutical R&D”.