Pfizer improves Patent Search 10-fold with Linguamatics I2E

Intellectual property is critical in the drug discovery process. Before initiating any new project it is important to understand the patent landscape around any particular disease area, check if there is freedom-to-operate, and assess patentability. The business case to assess commercial viability for a project must cover not just the biology, such as “is there unmet medical need” but also, “what is the IP position”.

Streamlining patent research with natural language processing (NLP) text mining

So, scientists and researchers need to be able to access the information on genes and diseases in patents. But patents can be hundreds of pages long and contain complex information constructions and interconnected facts.  Manual patent research is a time-consuming and costly process. More and more pharma companies, such as Pfizer, are looking to NLP text mining to keep up to date with their patent literature.

Pfizer researchers use Linguamatics Life Science Platform powered by I2E to find patents relating to specific diseases. The results feed a database to visualize gene targets, invention type, competitor organizations and overall patent “relevancy”. 


Pentavere Research Group of Toronto, Canada, was developing a platform to provide health insights from Real-World Evidence (RWE). Pentavere’s aim is to improve healthcare efficiency by allowing life science companies and healthcare providers to understand the impact of clinical decisions made in the primary care setting.

The company’s proprietary platform, daRWEn™, uses digitized, de-identified, and aggregated health information, but much of the valuable data that it wanted to include was locked inside free-form text, making it difficult to extract. Pentavere soon realized that it needed to incorporate natural language processing (NLP) capabilities into its platform in order to access these RWE insights. To achieve this in a timely and efficient manner, it chose to integrate the Linguamatics I2E NLP solution into daRWEn™.

Why Linguamatics? There were several important factors, including:


A project with Drexel University highlighted the importance of being able to apply Natural Language Processing (NLP) to assist with their cohort selection to support medical research, clinical trials recruitment and outcomes analysis.

Drexel was setting up a study into patients with HIV and Hepatitis C and needed to identify potential subjects from their AllScripts EHR. As many organizations do, they had five medical students spend four months (working part-time) trawling through patient records to identify 700 potential study candidates.

The process was particularly painful because simply looking for the ICD codes for HIV and Hepatitis C in structured fields was missing significant numbers of potential subjects. This was caused by variations in where the data was recorded; sometimes it was coded in structured fields; sometimes it was written in the patient narrative that he or she was positive for HIV or Hepatitis C; sometimes it was both.

Assessing the narrative is always a problem with variations in patient history vs family history and “tested for HIV, negative result” and “positive for HIV” requiring careful reading.

Drexel utilized our I2E NLP platform and had indexed a large collection of patient records by extracting documents from AllScripts via their analytical data warehouse.

The data sets were indexed with the usual domain ontologies covering diseases, medications, procedures etc. to support rapid searching in I2E.


Text Mining Platform I2E features in Best Practices Final and as a Best of Show Award Contender; Linguamatics CTO David Milward a Featured Speaker

Cambridge, UK & Boston, USA – May 22, 2017 – Leading Natural Language Processing (NLP) text analytics provider Linguamatics today announced plans to highlight the latest version of its text mining platform at this week’s Bio-IT World Conference & Expo in Boston. Bio-IT World has named Linguamatics I2E 5.0 a contender for the Best of Show Award, and Linguamatics’ customer Pentavere Research Group a Best Practices finalist.

The Best of Show Awards showcase exceptional innovation in technologies used by life science professionals. As a Best of Show Award contender, Linguamatics is also eligible for the Bio-IT World People’s Choice Award, chosen by votes from the Bio-IT World Community. Voting for the People’s Choice Award is open from 5 pm ET Tuesday May 23 through 1 pm ET on Wednesday May 24.

Bio-IT World also chose Linguamatics' customer Pentavere Research Group as a Best Practices finalist, based on their work using I2E to mine unstructured data for real-world evidence to improve health outcomes. Best Practices finalists are recognized for their outstanding examples of technology innovation, from basic R&D to translational medicine. Pentavere deployed I2E to effectively mine unstructured EHR data, expediting delivery of their product daRWEn™ to the Real World Evidence market.


There’s a lot of buzz in the healthcare community at the moment surrounding the use of artificial intelligence with machine learning for pattern identification, decision-making, and outcome prediction. The availability of high-quality data for training algorithms is vital to machine learning’s success - but a lot of this information is tied up in unstructured clinical notes. Natural language processing (NLP) is the key to extracting the “good stuff” from this vast trove of unstructured text. Combining that “good stuff” with already structured data helps healthcare providers to understand the patterns and trends in data via machine learning - and thereby enhance care, reduce costs, and improve population health.

Which type of NLP software is best?

The first question that healthcare users must ask themselves is “Which type of NLP software best suits my needs?”

Statistical NLP systems require example data to identify patterns in new data. The examples may come from dictionaries or ontologies - or they might need to be manually annotated by a clinician - which can be an extremely laborious and institutionally costly task.

Meanwhile, most rule-based NLP systems require a specialist to define the types of language rule or pattern that represent certain healthcare concepts. This approach can make them more accurate, but they will be limited only to the patterns that the specialist has thought of.