The recent two day II-SDV meeting in the beautiful town of Nice on the Côte d’Azur, France, started with a day of talks considering the question of how to best maximise the value of data extracted from a wide range of sources: patents, full text articles and even big data.

The programme kicked off with a presentation from Aleksander Kapisoda from Boehringer Ingelheim (BI) describing how innovative use of custom search techniques beyond that currently offered by standard public search machines can bring tangible benefits to a global pharmaceutical company.

Last week’s BioIT World Expo kicked off with a great keynote from Philip Bourne (Associate Director for Data Science, National Institutes of Health) setting the scene on a theme that ran through out the conference – how can we benefit from big data analytics, or data science, for pharma R&D and delivery into healthcare. With 2 days of talks, 12 tracks covering cloud computing, NGS analytics, Pharmaceutical R&D Informatics, Data Visualization & Exploration Tools, and Data Security, plus a full day of workshops beforehand, and a busy exhibition hall, there was plenty to see, do, take in and discuss.  I attended several talks on best practise in data science, by speakers from Merck, Roche, and BMS – and I was pleased to hear speakers mention text analytics, particularly natural language processing, as a key part of the overall data science solution.

Submitting a drug approval package to the FDA, whether for an NDA, BLA or ANDA, is a costly process.

The final amalgamation of different reports and documents into the overview document set can involve a huge amount of manual checking and cross-checking, from the subsidiary documents to the master.

It is crucial to get the review process right.

Any errors, and the FDA can send back the whole package, delaying the application. But the manual checking involved in the review process is tedious, slow, and error-prone.

A delayed application can also be costly.

How much are we talking about?

While not every drug is a blockbuster, these numbers are indicative of what you could be losing: the top 20 drugs in the United States accounted for $319.9 billion in sales in 2011; so a newly launched blockbuster could make around $2Bn in the first year launched – that’s $6M per day.

If errors in the quality review hold up an NDA for even just a week this could generate significant costs.

So – how can text analytics improve this quality assurance process?

Linguamatics has worked with some of our top 20 pharma customers to develop an automated process to improve quality control of regulatory document submission.

The process cross-checks MedDRA coding, references to tables, decimal place errors, and discrepancies between the summary document and source documents. This requires the use of advanced processing to extract information from tables in PDF documents as well as natural language processing to analyze the free text.

Patent literature is a hugely valuable source of novel information for life science research and business intelligence.

The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

These are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents.

Innovative use of I2E resulted in a 99% efficiency gain in delivering relevant information

Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Bristol Myers Squibb Text analytics workflow uncovers kinase assay trends

A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology.

Big data? Real world data? What do we really mean?

I was at a conference a couple of weeks ago, an interesting two days spent discussing what is big data in the life science domain, and what value can we expect to gain from better access and use.

The key note speaker kicked off the first day with a great quote from Atul Butte: “Hiding within those mounds of data is knowledge that could change the life of a patient or change the world”.

This is a really great ambition for data analytics.  But one interesting topic was, what do we mean by big data? One common definition from some of the Pharma folk was, that it was any sort of data that originated outside their organization that related to patient information.

To me, this definition seems to refer more to real world data – adverse event reports, electronic health records, voice of the customer (VoC) feeds, social media data, claims data, patient group blogs. Again, any data that hasn’t been influenced by the drug provider, and can give an external view – either from the patient, payer, or healthcare provider.

Many of these real world sources have free text fields, and this is where text analytics, and natural language processing (NLP), can fit in. We have customers who are using text analytics to get actionable insight from real world data – and finding valuable intelligence that can inform commercial business strategies.

Valuable information could be found in electronic health records, but these are notoriously hard to access for Pharma, with regulations and restrictions around data use, data privacy etc.

So, what real world data are accessible?