Skip to main content

Bringing Order to the Chaos of Big, Messy, Disparate Data

Doctors hospital payers providers enterprise

As big data continues to get bigger, you may have heard data scientists assessing the influx of information through the lens of key characteristics, otherwise known as the “Seven V’s” of big data. These include the data’s Volume, Velocity, Variety, Variability, Veracity, Visualization, and Value. Much has been written about how to define and optimize data through these lenses, and while each is important, I think there is a much simpler way to understand how most of us in the real world of healthcare experience big data: It’s big, meaning the volume increases exponentially on a daily basis; it’s messy, meaning it is largely unstructured with varying degrees of completeness; and it’s disparate, meaning it comes from multiple different systems, which usually don’t communicate with each other.

So far, I’m painting a rather bleak picture, and you might be wondering why you should bother trying to make sense of the confusing spaghetti of data out there. I’m here to tell you that you absolutely should, and with today’s natural language processing (NLP) technology, it is easier than you might think to unlock value from the never-ending deluge of information.

Breaking down the big, the messy, and the disparate

The rate of data growth in general remains exponential. We are creating more data on a day-to-day basis than ever before. On average, a hospital system produces 50 petabytes of data a year. This is a big number, but for those who, perhaps like me, aren't so familiar with your different byte sizes, you can think of one petabyte as 11,000 4K movies, 4,000 digital photos taken every day for the rest of your life, or USB sticks laid out over 92 football pitches. So when we talk about volume and velocity, what we really mean is that today’s data streams are enormously big and coming in fast, nonstop, every single day.

The next thing to consider is that the incoming data for use in healthcare is largely unstructured, particularly within an electronic medical record, or EMR. So much of the information in these files is variable, with different frequencies and taxonomies. And it’s coming from a broad variety of diverse formats, like spreadsheets, databases, sensor readings, texts, photos, audios, videos, multimedia files, and more – most of which are not from a controlled environments and therefore potentially incomplete, forcing us to question the veracity of what we receive. So it’s safe to say that not only is healthcare data big, but also that it’s very messy.

And finally, when we add in that a healthcare system has, on average, over 15 different EMR systems, then we can see that not only is the data big and messy, but it's disparate. All this together explains why 97% of healthcare data is never used beyond its original creation.

Clinical NLP - finding information beyond medical codes

Putting all these together really paints a picture of the complexity and difficulty in being able to extract meaning from these complex healthcare data. So why bother? The answer is that unstructured data matters because without it we are operating with major blind spots. To demonstrate, let's just take an example of a simple clinical encounter between and patient and their physician.  As standard, a clinician will discuss the presenting complaint, its relevant history, the patients past medical history, social history, family history and medication history – including allergies. This information is all captured either through the Dictaphone or typed notes. However, for conventional billing purposes, most of this information is not needed and is therefore left behind. If we extend our thinking on data capture to beyond what matters for billing, but what matters for patient and population health – then the rich information noted by the care team in documents such as  a clinic letter or discharge letter should be our most urgent priority. If we only scratch the surface of these data – we can uncover  valuable information such as medication compliance, smoking status, alcohol consumption, whether the patient lives alone or with family, and disease severity– all prognostic indicators that billing codes can’t come close to providing.  

The perceived problem is that this information is all written in free form textual format – with many potential ways and nuances of saying the same thing. The good news is that clinical natural language processing has been developed to address this exact challenge – allowing the pieces to be filled in, and the full picture to be seen of not just the patient, but also the populations in which they belong.

For example, we have worked with an academic medical center that knew social isolation tends to drive outcomes like missed appointments in cancer patients. They wanted to proactively identify these patients, so we deployed our NLP against 150,000 clinician documents to identify, with high accuracy, patients who were socially isolated. There’s no medical code for isolation of course, so this was extracted from unstructured text buried in clinician notes, and using Linguamatics best-in-breed NLP, we were able to do so in just eight seconds. By proactively looking and identifying social risk factors within a population, organizations can identify those patients at risk and establish proactive outreach, like transportation or community support. This application of NLP on existing data to drive preventative care can reduce missed appointments, increase patient engagement and avoid expensive hospitalizations as a consequence of patients not being managed appropriately.

Visualization and value

This is just one of many examples of hidden value that is locked away in unstructured data. Today’s data may be big, messy, and disparate, but modern NLP solutions are expert at transforming the chaos of unstructured text into the final two V’s we haven’t yet discussed: visualization and value. Linguamatics NLP can rapidly intake unstructured data from a wide variety of sources and cut through the noise to produce easily comprehendible and actionable results. Whether exported to downstream BI tools, or interacted with in the NLP Insights Hub, our core NLP capability amplifies the relevant signal above the noise to let you see what is important to you.

For providers, payers and all the way through pharma, the ability to unlock context in your untapped data has powerful implications. Insights supporting prescribing patterns, risk assessments, safety netting or the ability to build more powerful predictive models are all at your fingertips with today’s NLP. If you are interested in unlocking the untapped textual data within your practice, please, reach out to Linguamatics today.

Learn more about how NLP can be used rapidly surface features of interest at scale, in an automated, robust and easily configurable pipeline to deliver comprehensive value across multiple lines of business.

Visit the NLP Data Factory webpage


Ready to get started?

Request a Demo

Questions? Ask our experts