Shifting payment models based on quality and value are fueling the demand for insights into the health of populations. This demand requires the analysis of vast amounts of patient data. For example, before healthcare organizations can implement pre-emptive care programs, they must first identify the relative risk of their patient population. This is based on a variety of clinical, financial, and lifestyle factors, including:

  • Problem list of patients, especially chronic conditions
  • Procedures, medications and other hospital data
  • Claims information
  • Risk factors such as tobacco, alcohol and drug use
  • Availability and accessibility of health services and social support.

As illustrated in Figure 1, a healthcare population typically includes a relatively small percentage of the highest-risk patients, though these least healthy patients usually account for the biggest percentage of overall healthcare costs.


Figure 1: Level of patient risk associated with population segments and
their cost implications; a relatively small segment of the population
accounts for a disproportionate percentage of healthcare costs

With new and exciting technologies, it often happens that one particular application or use case leads the way initially… and then, when the euphoria turns into commercial reality, people start looking at other applications where the new technology can also bring value. In text mining, the same holds true. Pharma companies have now been using NLP text mining technologies for many years, in areas such as target validation, gene-disease associations, clinical trial optimization, and patent analytics, for example. As they become comfortable and, indeed, expert in these areas, attention has turned to areas where the core technology needs to be adapted or tweaked to meet a specific requirement.

For example, when looking to apply NLP to the time-consuming and costly business of discovering new, novel compounds, users hit a significant issue; trying to understand every single component part of some of the long chemical names. Not an insurmountable problem, but one that needed time, expertise and determination.

I attended a Big Data in Pharma conference recently, and very much liked a quote from Sir Muir Gray, cited by one of the speakers: "In the nineteenth century health was transformed by clear, clean water. In the twenty-first century, health will be transformed by clean clear knowledge."  

This was part of a series of discussions and round tables on how we, within the Pharma industry, can best use big data, both current and legacy data, to inform decisions for the discovery, development and delivery of new healthcare therapeutics. Data integration, breaking down the data silos to create data assets, data interoperability, use of ontologies and NLP - these were all themes presented; with the aim of enabling researchers and scientists to have a clean, clear view of all the appropriate knowledge for actionable decisions across the drug development pipeline. 

A new publication describes how text analytics can provide one of the tools for that data interoperablity ecosystem, to create a clear, clean view.  McEntire et al. describe a system that combines Pipeline Pilot workflow tools, Linguamatics I2E NLP linguistics and semantics, and visualization dashboards, to integrate information from key public domain sources, such as MEDLINE, OMIM,, NIH grants, patents, news feeds, as well as internal content sources.

What if physicians could offer patients access to a potentially life-preserving test, but could not easily identify which of their patients were eligible?

That is the exact situation many providers have found themselves in since Medicare announced it would begin covering lung cancer screening for patients meeting a certain set of criteria.

In a decision memo published February, 2015, CMS agreed to make Medicare coverage available for a low dose computed tomography (LDCT) lung cancer screening for eligible patients. Patients who are between ages 55 and 77, asymptomatic, are either a current smoker or have quit within the last 15 years, and, have a tobacco smoking history of at least 30 pack-years can now qualify for an annual preventative screening.

CMS added the coverage after determining there was sufficient evidence that LDCT procedures were cost-effective for high risk populations. A study by the National Lung Cancer Screening Trial, for example, found that 12,000 deaths a year could be avoided if high-risk patients underwent a LDCT scan. Lung cancer is currently the leading cause of cancer-related death among both men and women in the US.

Linguamatics hosted our Spring Text Mining Conference in Cambridge last week (#LMSpring16). Attendees from the pharmaceutical industry, biotech, healthcare, personal consumer care, crop science, academia, and partner vendor companies came together for hands-on workshops, round table discussions, and of course, some excellent presentations and talks. 

The talks kicked off with a presentation by Thierry Breyette, Novo Nordisk, who described three different projects where text mining provided signficant value from real world data.  Thierry took the RAND Corporation definition: "Real-world data (RWD) is an umbrella term for different types of data that are not collected in conventional randomised controlled trials. RWD comes from various sources and includes patient data, data from clinicians, hospital data, data from payers and social data."

At Novo Nordisk they have gained business impact by text mining a variety of souces, including: social media to find digital opinion leaders; conversation transcripts between medical liaisons and healthcare professionals for trends around clinical insights; and mining patient & caregiver ethnographic data to see patterns in patient sentiment and compliance.