Submitting a drug approval package to the FDA, whether for an NDA, BLA or ANDA, is a costly process.

The final amalgamation of different reports and documents into the overview document set can involve a huge amount of manual checking and cross-checking, from the subsidiary documents to the master.

It is crucial to get the review process right.

Any errors, and the FDA can send back the whole package, delaying the application. But the manual checking involved in the review process is tedious, slow, and error-prone.

A delayed application can also be costly.

How much are we talking about?

While not every drug is a blockbuster, these numbers are indicative of what you could be losing: the top 20 drugs in the United States accounted for $319.9 billion in sales in 2011; so a newly launched blockbuster could make around $2Bn in the first year launched – that’s $6M per day.

If errors in the quality review hold up an NDA for even just a week this could generate significant costs.

So – how can text analytics improve this quality assurance process?

Linguamatics has worked with some of our top 20 pharma customers to develop an automated process to improve quality control of regulatory document submission.

The process cross-checks MedDRA coding, references to tables, decimal place errors, and discrepancies between the summary document and source documents. This requires the use of advanced processing to extract information from tables in PDF documents as well as natural language processing to analyze the free text.

Patent literature is a hugely valuable source of novel information for life science research and business intelligence.

The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

These are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents.

Innovative use of I2E resulted in a 99% efficiency gain in delivering relevant information

Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Bristol Myers Squibb Text analytics workflow uncovers kinase assay trends

A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology.

Big data? Real world data? What do we really mean?

I was at a conference a couple of weeks ago, an interesting two days spent discussing what is big data in the life science domain, and what value can we expect to gain from better access and use.

The key note speaker kicked off the first day with a great quote from Atul Butte: “Hiding within those mounds of data is knowledge that could change the life of a patient or change the world”.

This is a really great ambition for data analytics.  But one interesting topic was, what do we mean by big data? One common definition from some of the Pharma folk was, that it was any sort of data that originated outside their organization that related to patient information.

To me, this definition seems to refer more to real world data – adverse event reports, electronic health records, voice of the customer (VoC) feeds, social media data, claims data, patient group blogs. Again, any data that hasn’t been influenced by the drug provider, and can give an external view – either from the patient, payer, or healthcare provider.

Many of these real world sources have free text fields, and this is where text analytics, and natural language processing (NLP), can fit in. We have customers who are using text analytics to get actionable insight from real world data – and finding valuable intelligence that can inform commercial business strategies.

Valuable information could be found in electronic health records, but these are notoriously hard to access for Pharma, with regulations and restrictions around data use, data privacy etc.

So, what real world data are accessible?

Patent information professionals gathered in sunny San Francisco for the 2015 PIUG Biotechnology Conference on February 16–18th.

The conference, hosted at Genentech, offered a mix of workshops, presentations, vendor exhibitions and networking opportunities that brought together patent searchers from diverse biotechnology organizations.

The central theme of this year’s conference was “Maximizing Value in Biotechnology Searching with New Technologies and Trends”. Delegates were eager to enhance existing search strategies which included a mix of content provider search tools, keyword search, in-house developed programming/machine learning, manual curation outsourcing and, for many, Linguamatics I2E.

They all had one thing in common, everyone was interested in finding new trends, techniques and technologies that would help them return more relevant patent information more efficiently.

The conference started on the first day with a series of workshops. David Milward, our CTO, delivered a workshop on new developments in text mining patents.

The workshop included an overview of updates to our text mining platform I2E, to allow easier embedding and automation, multilingual processing, improved visualization and simpler extraction of information from tables – all of which resonated well with this year’s theme.

I read with interest a recent publication which sheds light on the complex interactions of synapse protein complexes with human disease.

The study (run by the Genes to Cognition neuroscience research programme) combined wet-lab research with bioinformatics and text analytics to uncover genetic associations with these protein complexes in over seventy human brain diseases, including Alzheimer’s Disease, Schizophrenia and Autism spectrum disorders.

The idea was to identify and develop suitable screening assays for synapse proteomes from post-mortem and neurosurgical brain samples, focusing specifically on Membrane-associated guanylate kinase (MAGUK) associated signalling complexes (MASC).

Our CTO, David Milward was involved in the text analytics work. He used the natural language processing capabilities of Linguamatics I2E platform to extract gene-mutation-disease associations from PubMed abstracts. The flexibility of I2E enabled an appropriate balance of recall and precision, thus providing comprehensive results while not overloading curators with noise. Queries were built using linguistic patterns to allow associations to be discovered between a list of several thousand relevant gene identifiers, and appropriate MedDRA disease terms.

The key aim was to provide comprehensive results with suitable accuracy to allow fast curation. These text-mined results were combined with data from Online Mendelian Inheritance in Man (OMIM) on human MASC genes and genetic disease associations.