I recently attended a talk by Linguamatics CTO David Milward on Structured Queries for Unstructured Data, delivered to the Data Insights Cambridge Meetup group.

The data science community wants to know:

  • How can we deliver insights from big data?

  • What are the optimal approaches to ‘handle’ (store, capture) and analyze (query, structure, repurpose) big data?

The amount of data we can store and generate is many times what we could store or capture just 10 years ago. SQL Database technology is able to handle structured data well and has not changed significantly since the 1980s.  It’s easier to deliver insights from structured data for basic queries than it is for unstructured data in free text sources.

Unstructured data is the new frontier for data science

What drew so many people to David’s talk is the promise of the ‘data insights’ that are locked away in unstructured data. The audience spanned various industries, from those dealing with astronomical data to financial data sources, to many people concerned with health and life science unstructured data. Many industries rely heavily on data to inform their day to day business decisions. For healthcare and life science, where Linguamatics is the text mining leader, transforming how we understand and improve upon population health and patient outcomes will primarily entail extracting data insights from unstructured data sources. 

How ready are you for IDMP?

IDMP (IDentification of Medicinal Products) is a set of international standards developed by ISO that will become mandatory in Europe in a phased approach, effective from 2018, and will also be adopted by the FDA and globally over the next few years. As with any new regulatory change, it is valuable to hear about others' experiences and ideally understand and learn from industry best practice. 

Joining the IRISS Forum is the best way of keeping track of IDMP. I joined IRISS this year - it is an excellent source for up-to-date IDMP information and also valuable input from industry experts (such as Andrew Marr, Vada Perkins, and others).

The IRISS (Implementation of Regulatory Information Submission Standards) Forum was created to address the need for a single central forum for open and broad stakeholder discussion of evolving standards, user requirements and practical, global implementation issues of these standards for the mutual benefit of both industry, government agencies and ultimately, public health.

IRISS recently (September 2016) surveyed its members, both pharmaceutical and vendor, across the current state of readiness around IDMP compliance. The companies that took part in the industry survey covered a wide range of organizational sizes, from small companies those with less than 100 EU authorizations to larger ones with more than 5000 (or, less than 10 active ingredients to more than 250). Over 80% of those in the survey had a global reach. 


Drug safety and pharmacovigilance are critical aspects of drug development. To understand and monitor potential risks for pharmaceuticals, researchers use many different strategies to uncover evidence of real-world reports of adverse events and patient-reported outcomes.

At the upcoming Linguamatics Text Mining Summit, there are three talks on text mining strategies that improve our understanding of drug-related adverse reactions.

Nina Mian from AstraZeneca will present research on text mining adverse event data both from FDA drug labels (derived from clinical trial data), and also from real world data from PatientsLikeMe. Eric Lewis from GSK will discuss applications of I2E for clinical safety and pharmacovigilance – particularly the problems of identifying potential “new signals” and distinguishing signal from noise. And Stuart Murray from Agios will present workflows for automated identification of potential drug safety events.

These talks, from industry specialists, demonstrate the value of text mining to access and understand the complex world of drug safety and safety signals.

Life sciences researchers who employ text mining and NLP (natural language processing) techniques to extract discrete facts and insights from scientific articles have generally relied on MEDLINE abstracts to define a corpus.

But there is increasing interest in mining full-text articles, with researchers experimenting on corpora sourced to Open Access repositories like PubMed Central (PMC). Organizations are eager to take advantage of the unique benefits full text provides, and rightly so.

Full-text content provides insights that researchers otherwise wouldn’t have had access to using abstracts alone. Here are three central benefits of mining a full-text corpus:

Volume. Full-text articles include more named entities and relationships between those entities than their corresponding abstracts – this is intuitively obvious when we consider the length of an abstract versus its full-text article. A study published in the Journal of Biomedical Informatics makes this point quantitatively: Only 7.84% of the scientific claims made in full-text articles are found in their abstracts.[i]

Wilmington, DE – The Pistoia Alliance, an organization dedicated to improving global life sciences R&D, has seen its membership increase with a number of new members including both large multinationals and start-ups.

The new members include Accenture, Linguamatics, Novaseek, Repositive, Agrimetrics and Daniel Taylor. Existing members upgrading to the new Startup membership category this quarter include KNIME, Scitegrity, Databiology, The Hyve, Binocular Vision, BioVariance, and Promeditec. This takes the membership of the Pistoia Alliance to over 80 globally, which includes many of the world’s biggest pharmaceutical companies, many of the most innovative start-ups and companies and organizations that support the life sciences sector.

Dr. Steve Arlington, Pistoia Alliance President said: “I am delighted to see that the Pistoia Alliance continues to show strong growth across all segments in life sciences, and we continue to attract a broad range of members. At the same time, we are also changing how we operate. Our challenge is to promote and encourage pre-competitive collaboration between our members, to benefit our members and ultimately accelerate the delivery of new drugs, devices and services to enhance performance within the sector. The Pistoia Alliance is well placed to help life sciences tackle many of its challenges and through our new strategy and the continued support of our members we will continue to support the global life sciences industry.”