Patent literature is a hugely valuable source of novel information for life science research and business intelligence.

The wealth of knowledge disclosed in patents may not be found in other information sources, such as MEDLINE or full text journal articles.

Patent landscape reports (also known as patent mapping or IP landscaping) provide a snap-shot of the patent situation of a specific technology, and can be used to understand freedom to operate issues, to identify in- and out-licensing opportunities, to examine competitor strengths and weaknesses, or as part of a more comprehensive market analysis.

These are valuable searches, but demand advanced search and data visualization techniques, as any particular landscape reports requires examination of many hundreds or thousands of patent documents.

Innovative use of I2E resulted in a 99% efficiency gain in delivering relevant information

Patent text is unstructured; the information needed is often embedded within the body of the patent and may be scattered throughout the lengthy descriptions; and the language is often complex and designed to obfuscate.

Bristol Myers Squibb Text analytics workflow uncovers kinase assay trends

A recent paper by a team at Bristol Myers Squibb describes a novel workflow to discover trends in kinase assay technology.

The aim was to strengthen their internal kinase screening technology, with the first step being to analyze industry trends and benchmark BMS’ capabilities against other pharmaceutical companies, with key questions including:


Big data? Real world data? What do we really mean?

I was at a conference a couple of weeks ago, an interesting two days spent discussing what is big data in the life science domain, and what value can we expect to gain from better access and use.

The key note speaker kicked off the first day with a great quote from Atul Butte: “Hiding within those mounds of data is knowledge that could change the life of a patient or change the world”.

This is a really great ambition for data analytics.  But one interesting topic was, what do we mean by big data? One common definition from some of the Pharma folk was, that it was any sort of data that originated outside their organization that related to patient information.

To me, this definition seems to refer more to real world data – adverse event reports, electronic health records, voice of the customer (VoC) feeds, social media data, claims data, patient group blogs. Again, any data that hasn’t been influenced by the drug provider, and can give an external view – either from the patient, payer, or healthcare provider.

Many of these real world sources have free text fields, and this is where text analytics, and natural language processing (NLP), can fit in. We have customers who are using text analytics to get actionable insight from real world data – and finding valuable intelligence that can inform commercial business strategies.

Valuable information could be found in electronic health records, but these are notoriously hard to access for Pharma, with regulations and restrictions around data use, data privacy etc.

So, what real world data are accessible?


Patent information professionals gathered in sunny San Francisco for the 2015 PIUG Biotechnology Conference on February 16–18th.

The conference, hosted at Genentech, offered a mix of workshops, presentations, vendor exhibitions and networking opportunities that brought together patent searchers from diverse biotechnology organizations.

The central theme of this year’s conference was “Maximizing Value in Biotechnology Searching with New Technologies and Trends”. Delegates were eager to enhance existing search strategies which included a mix of content provider search tools, keyword search, in-house developed programming/machine learning, manual curation outsourcing and, for many, Linguamatics I2E.

They all had one thing in common, everyone was interested in finding new trends, techniques and technologies that would help them return more relevant patent information more efficiently.

The conference started on the first day with a series of workshops. David Milward, our CTO, delivered a workshop on new developments in text mining patents.

The workshop included an overview of updates to our text mining platform I2E, to allow easier embedding and automation, multilingual processing, improved visualization and simpler extraction of information from tables – all of which resonated well with this year’s theme.


I read with interest a recent publication which sheds light on the complex interactions of synapse protein complexes with human disease.

The study (run by the Genes to Cognition neuroscience research programme) combined wet-lab research with bioinformatics and text analytics to uncover genetic associations with these protein complexes in over seventy human brain diseases, including Alzheimer’s Disease, Schizophrenia and Autism spectrum disorders.

The idea was to identify and develop suitable screening assays for synapse proteomes from post-mortem and neurosurgical brain samples, focusing specifically on Membrane-associated guanylate kinase (MAGUK) associated signalling complexes (MASC).

Our CTO, David Milward was involved in the text analytics work. He used the natural language processing capabilities of Linguamatics I2E platform to extract gene-mutation-disease associations from PubMed abstracts. The flexibility of I2E enabled an appropriate balance of recall and precision, thus providing comprehensive results while not overloading curators with noise. Queries were built using linguistic patterns to allow associations to be discovered between a list of several thousand relevant gene identifiers, and appropriate MedDRA disease terms.

The key aim was to provide comprehensive results with suitable accuracy to allow fast curation. These text-mined results were combined with data from Online Mendelian Inheritance in Man (OMIM) on human MASC genes and genetic disease associations.


What challenges were seen in competitive R&D and clinical stages? What outcomes were measured in related trials? Does the drug I am creating have potential efficacy or safety challenges? What does the patient population look like?

These are the sort of critical business questions that many life science researchers need to answer. And now, there’s a solution that can help you.

We all know the importance of high quality content you can depend on when it comes to making key business decisions across the pharma life cycle. We also know that the best way to get from textual data to new insights is using natural language processing-based text analytics. And that’s where our partnership with Thomson Reuters comes in. We’ve worked together on a solution to bring Linguamatics market-leading text mining platform, I2E, together with Thomson Reuters Cortellis high-quality clinical and epidemiology content: Cortellis Informatics Clinical Text Analytics for I2E.

Cortellis Informatics Clinical Text Analytics for I2E applies the power of natural language processing-based text mining from Linguamatics I2E to Cortellis clinical and epidemiology content sets. Taking this approach allows users to rapidly extract relevant information using the advanced search capabilities of I2E. The solution also allows users to identify concepts using a rich set of combined vocabularies from Thomson Reuters and Linguamatics.