
It is well known that the drug discovery and development process is lengthy, expensive and prone to failure. Starting from the selection of a novel target in discovery, through the multiple steps to regulatory approval, the overall probability of success is less than 1%.
One factor is that the majority of diseases are multifaceted, hence the challenge is identifying the most appropriate patient populations who will respond to specific interventions. A stratified approach has proven beneficial in a number of cancers and genetic diseases, and pharmaceutical companies have a strong interest in understanding how to find the sub-populations of patients to ensure the most appropriate therapies are tested in clinical trials, and applied in broader clinical use.
The ultimate aim of a stratified approach to medicine is to enable healthcare professionals to provide the “right treatment, for the right person, at the right dose, at the right time”; and there are many research initiatives (governmental, private, public) on-going to develop the appropriate knowledge and models.
Such models need suitable data: clean, structured, and granular enough to provide the required level of detail for the disease model of interest. EMRs contain a mix of structured, semi-structured and unstructured data (see figure). While the structured and semi-structured are relatively straightforward to input into models, there is much more detail and content in the unstructured text, highly valuable but difficult to utilize. Natural Language Processing (NLP) can target these unstructured free text fields, extract the data (e.g. symptoms and disease severity) that otherwise is unavailable.
EMRs contain a mix of data types. 1. structured fields (e.g. ATC code, discharge data) which tend to be controlled fields with validated data entry. 2. Semi-structured fields are often parameter value pairs (e.g. dosage, smoking, body temperature). 3. Unstructured free text, where nurses and physicians can input case notes, discharge notes, with all the details of the patient that do not fit into structured or semi structured fields.
Bristol-Myers Squibb use NLP for patient stratification and risk assessment
One successful example of using data from the unstructured text was published by Bristol-Myers Squibb (BMS). BMS wanted to understand more about patient stratification for heart failure risk. Heart failure patients typically exhibit high levels of clinical heterogeneity, which is problematic for treatment and for risk stratification. BMS researchers believed that if they could acquire a deeper understanding of the clinical characteristics of these patients, they could potentially understand how best to treat different patients or populations.
To that end, researchers obtained Electronic Health Record (EHRs) and imaging data for about 900 patients, and used Linguamatics NLP to write queries, extract and normalize approximately 40 different variables around patient demographics, clinical outcomes, clinical phenotypes and other variables such as ejection fraction and left ventricular mass.
With advanced statistical clustering, BMS researchers identified four classes of patients with discrete clinical and echocardiographic characteristics. The four groups showed significant differences in one- and two-year mortality and also one-year hospitalizations. By better understanding how to stratify patient populations for heart failure, BMS has unlocked insights on that can potentially improve the design of clinical trials, identify unmet needs and develop better therapeutics.
Find out more on Real World Data or contact us for a demo of our NLP platform.
