Natural language processing from target to treatment: new learnings from Cambridge

April 16 2019

Spring is a lovely time to be in Cambridge – winter is finally moving on, the spring bulbs are out and the trees are in blossom. Time for Linguamatics Spring Text Mining Conference, which again this year was blessed with lovely sunshine. And of course, the opportunity to hear the latest about Linguamatics products and some new and fascinating use cases from our customers.

In March 2019, attendees from across pharma and healthcare came to our Spring Text Mining Conference, for hands-on workshops, a Healthcare Hackathon, networking and great presentations. The presentations covered innovations in using Natural Language Processing (NLP) to get more value from a range of unstructured text, covering electronic medical records, regulatory documents and patient social media verbatims.

Patient Insights from Social Media at Roche

Mathias Leddin, Senior Data Scientist, pRED Informatics at Roche gave a fascinating talk on the use of Linguamatics NLP to address patient-centered drug development (a relatively new FDA initiative). The focus was to discover if patient blogs and forum (such as PatientsLikeMe) can provide a good substrate to develop clinical endpoints that are relevant to patients. Being able to understand what matters most to patients and find unexpected insights into patients’ problems could (and should) influence clinical trials, e.g. design and outcome measures. Using “highly trusted” social media sources (i.e. patient-focused communities rather than more diverse Twitter or Facebook posts) gave a more robust substrate to analyze. It is still a noisy process; from 24k verbatims downloaded, they gleaned valuable data from ~450 posts. Ensuring that privacy issues were addressed, they were able to categorize the comments into symptom or impact categories (e.g. Tingling: “electrical sensations all over my body”; “a tingling sensation, like ants”; “electrical nerve sensations”). Mathias described finding symptoms confirmatory of the clinical trial endpoints, but also new ones (e.g. tingling, voice change, vertigo); and these specific recommendations have been taken forwards.

J&J NLP analytics for regulatory compliance and IDMP master data management

IDMP provides a common language to connect currently siloed data across R&D and supply chain systems and is being used by many pharma companies to assist with master data management across the enterprise. Christopher Dunn & Costas Mistrellides (Johnson & Johnson Consumer) gave a presentation on the value of I2E to extract ~30 standard IDMP data elements from regulatory documents including Summary of Product Characteristics and regulatory dossiers (eCTD sections 3.2.S and 3.2.P). The challenges included a varied set of documents, some up to 50 years old, in mixed formats (Doc, Docx, image and text PDFs), across 5 different languages (English, French, Spanish, German, Italian). The output needed to be mapped to the J&J schema for their IDMP submission and internal business use. Over 1300 documents were processed, with an overall accuracy above 94%, saving J&J Consumer significant time and resources.

Integrating Cambridge University Hospital EMR data to understand Surgical Site Infections

We also had a presentation from a team from Cambridge University Hospital: “Identify Surgical Site Infections (SSI) using Mixed Assessment of Free Text and Structured Data in Electronic Health Records”. Lydia Drumright, Vince Taylor and James Lester gave a great tag-team talk on the issues around identifying SSIs. These infections can cost billions of dollars per year in the US, and being able to diagnose these early has huge impact in cost savings, and of course patient care. However, finding the patients that are at risk is difficult as data collection isn’t comprehensive, especially as much of the information collected is in the unstructured text of medical records. Lydia, Vince and James described the work in progress to bridge the gap between structured and unstructured data, including the extract, transform and load processes and “data wrangling” needed to pull the records together and clean the data. They discussed the query modules needed (e.g. for explicit mention of ‘surgical site infection’/’surgical wound infection’ etc.; relevant signs and symptoms such as localized swelling, redness, fever; or purulent wound discharge. Finally, they covered the post-processing to join all the data together (surgical events to intraoperative summaries, microbiology, demographics, wound descriptions, and more). This is an exciting work-in-progress and we look forward to hearing more from the team, once they have built their gold standard and successful model!

CST: Mining full text papers with NLP, for basic cell biology and disease understanding

We were lucky to have one of our speakers travel from the US to present. Ela Skrzypek (Cell Signaling Technology) talked about the use of I2E at Cell Signaling Technology to create PhosphoSitePlus (PSP), a knowledge-based resource about protein modifications. PSP is a freely available database, continuously curated by Cell Signaling Technology scientists from scientific literature. PSP is used to elucidate the roles of post-translational modifications (PTMs) in normal and in pathological cellular processes, and accelerate the discovery of disease biomarkers and drug targets. There are over 1500 articles per month published, so article selection and curation can be a time-consuming and costly process. I2E has significantly reduced manual load, by enabling queries to be run for modification type, modification site, biological or disease processes, genes and proteins involved, and any diseases or chemical treatments that are related. The structured I2E output and highlighted cached documents enable rapid curation and review. Ela said:

Without I2E, we would not be able to fulfil our goal to be the most comprehensive resource for post-translational modifications and cell signaling for scientists around the world. We believe our work will continue to help scientists better understand the complexities of cellular systems and to develop new therapies for many devastating diseases like cancer or diabetes.

AI as a Partner in R&D to accelerate Chiesi Farmaceutici Drug Development

Last but by no means least of the customer talks came from Chiesi Farmaceutici. Carmela Pratelli presented on the journey the Chiesi team have travelled with Linguamatics over the past couple of years. Chiesi focus on three core therapeutic areas: respiratory, neonatology and rare diseases. The aim of the Scientific Knowledge & Information Resources Department is to accelerate turning information into strategic decisions. Carmella talked about the value of external partners who can provide the right technology and experience to assist with accelerating drug development, and the work they have done with Linguamatics within target validation & new product evaluation teams. One example Carmella presented was elucidating the literature landscape around idiopathic pulmonary fibrosis, a chronic, progressive and fatal interstitial lung disease with no known cause, in order to identify targets, tool compounds and biomarkers. Another looked for key opinion leaders in neonatal brain injury, with a workflow combining I2E, KNIME and Spotfire for integration, analysis and visualization of the data. Carmella also talked briefly about the next steps along the journey; they are now looking at using I2E for drug repurposing, patent analytics and opportunity scouting.

Linguamatics, IQVIA, and partner presentations

  • Marc Solfrian, IFI Claims, gave an update on the new content in Claims Direct, including exciting developments on machine translations using Google’s latest neural network technology, and improved Chinese coverage.
  • Nora Lapusnyik, ChemAxon, described the ChemAxon-Linguamatics partnership, now in its 10th year. Nora described some of the new functionality in ChemAxon, whose mission is to provide the best chemistry software solutions for design-make-test-analyse cycle, and is expanding both content and tools they deliver.
  • John Brimacombe gave a review of Linguamatics mission and progress over the past year.
  • Ben Hughes (SVP, Strategy and Technology, Real World & Analytics Solutions, IQVIA) provided an overview of IQVIA’s technology, and the “fit” of Linguamatics NLP solution into the IQVIA real world data technology ecosystem.
  • Paul Milligan presented an update of Linguamatics product portfolio, covering new changes in the iScite user experience, in OnDemand Content, and in our workflow tool AMP.
  • The Healthcare update came from Liz Marshall, with a fascinating talk on “Unstructured Health Data and the Road to Mars”;
  • And the Life Science update, “NLP transforms documents for decision support across Pharma”, was presented by Jane Reed.
  • David Milward talked about some of the new and exciting features in I2E, including developments for multilingual NLP, support for Documentum and SQL databases, advances in query power, and provision of tools for gold standard evaluations. David also gave a first preview of I2E in a web browser; this was also featured in a hands-on workshop, and we have been delighted with the excellent user feedback.

As in recent years, the conference was held at the Møller Centre, Churchill College, Cambridge, and we all very much enjoyed the excellent food and facilities. We also enjoyed a delightful meal at Corpus Christi College. Thank you to everyone who contributed, and we hope to see you all at our other events across the year.


Get in touch if you want more information about the event, or if you want to access the presentations.

See the other events we are attending soon.