In healthcare, the excitement about the potential for big data and machine learning is palpable, and there is more accessible electronic information than ever before.

The challenge for the healthcare community is that approximately 80% of the data in a typical electronic health record (EHR) is trapped within unstructured notes, which requires expensive human annotation to make it accessible to machine learning systems.

So what’s the solution? The use of Natural language processing (NLP), another artificial intelligence (AI) technique, can turn this unstructured text into a set of features for machine learning to use. Data-driven, rule-based NLP techniques can extract information from text using linguistic patterns and terminologies with high precision and recall —avoiding the need to manually annotate training data for the machine learning model.

Read the full PM360 article to find out more about how the combination of NLP and machine learning can be a powerful tool for developing predictive models in healthcare and life science.

Read the full article

About David Milward:

David Milward is Chief Technology Officer at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has over 20 years of experience in natural language processing (NLP) product development, consultancy, and research.


With a background such as mine - medicine/ information technology/ government/ military - you need to know your audience, and ensure acronyms are appropriate.

In healthcare alone, DOA can mean several things: degenerative osteoarthritis, date of arrival, drug of abuse, dead on arrival, etc. Most of which I REALLY don’t want to see in a healthcare analytical report for Rheumatology.

Although ETL is no exception, it is widely used in the world of healthcare now as “Extract Transform and Load” and - unless you are speaking to a someone in the area of pulmonary and respiratory diseases - it will seldom get confused with “expiratory threshold load” which helps determine respiratory muscle efficiency. Then there is AMP, which in medicine is most commonly known as a adenosine monophosphate a vital component in all living cells. But for Linguamatics Health users, AMP is an acronym that is vital in it’s own right and stands for Asynchronous Messaging Pipeline.

Here at Linguamatics we are grateful to have some very talented folks that can explain our technological world in a way that is (sometimes) less technical. Alex Richard-Hoyling ( Senior Solutions Developer) explained how he helps ensure reliable data extraction in large healthcare systems via the Linguamatics Community. Below, I take the subject a step further to cross the chasm of where tech meets med.


How do you ensure your healthcare company outshines the competition with so many choices out there? There’s an app for that! Well no - not yet, at least there wasn’t at the time I wrote this blog- I double checked. There is however, the National Committee for Quality Assurance (although no app, they do have a very informative Twitter account.)

The committee’s mission is to help continually ensure quality in health from all parties involved. For insurance companies, they use the Healthcare Effectiveness Data and Information Set (HEDIS) as it is “one of the most widely used sets of health care performance measures in the United States.”[1]. So rather than trying to compare two things that may sound like they are certainly similar, such as ‘pineapples to apples’, people now have a true method of payer comparison.

Download the PDF: Case Study on Big Data Analytics for Population Health

HEDIS consists of a set of measures around patient care and service. Measures vary from simple documentation of an adult Body mass index (BMI), a calculation involving only height and weight; to the more complicated documentation of comprehensive diabetes care.


Pentavere Research Group of Toronto, Canada, was developing a platform to provide health insights from Real-World Evidence (RWE). Pentavere’s aim is to improve healthcare efficiency by allowing life science companies and healthcare providers to understand the impact of clinical decisions made in the primary care setting.

The company’s proprietary platform, daRWEn™, uses digitized, de-identified, and aggregated health information, but much of the valuable data that it wanted to include was locked inside free-form text, making it difficult to extract. Pentavere soon realized that it needed to incorporate natural language processing (NLP) capabilities into its platform in order to access these RWE insights. To achieve this in a timely and efficient manner, it chose to integrate the Linguamatics I2E NLP solution into daRWEn™.

Why Linguamatics? There were several important factors, including:


There’s a lot of buzz in the healthcare community at the moment surrounding the use of artificial intelligence with machine learning for pattern identification, decision-making, and outcome prediction. The availability of high-quality data for training algorithms is vital to machine learning’s success - but a lot of this information is tied up in unstructured clinical notes. Natural language processing (NLP) is the key to extracting the “good stuff” from this vast trove of unstructured text. Combining that “good stuff” with already structured data helps healthcare providers to understand the patterns and trends in data via machine learning - and thereby enhance care, reduce costs, and improve population health.

Which type of NLP software is best?

The first question that healthcare users must ask themselves is “Which type of NLP software best suits my needs?”

Statistical NLP systems require example data to identify patterns in new data. The examples may come from dictionaries or ontologies - or they might need to be manually annotated by a clinician - which can be an extremely laborious and institutionally costly task.

Meanwhile, most rule-based NLP systems require a specialist to define the types of language rule or pattern that represent certain healthcare concepts. This approach can make them more accurate, but they will be limited only to the patterns that the specialist has thought of.