Veracity - Can Text Analytics help solve Uncertainty of Data?

July 28 2015

Giving a presentation on NLP text mining a couple of weeks ago*, I was asked whether our text analytics solution can help one of the extra Vs of big data – Veracity. This is a much-discussed topic at the moment, and after Volume Velocity and Variety, seems to be the most important of the additional Vs (see Seth Grimes blog for a good discussion on some more “wanna-Vs”).

Veracity, when it comes to data and decision making, can mean many things:

  • Does my conclusion make sense?
  • Is this particular data point accurate?
  • Do I trust this publication?
  • Is this assertion evidenced reliably?

 - but the bottom line is, if I am making an important business decision, how can I be sure it’s made using the best possible data?

This is obviously a tricky question and has been thrown into public view over recent years with studies trying to replicate critical experimental data and finding reproducibility frighteningly low (e.g. PLoS . So, how can a text analytics tools shed any light in such a minefield?

Scientists in the United States spend $28 billion each year on basic biomedical research that cannot be repeated successfully. That is the conclusion of a study published on 9 June 2015 in PLoS Biology that attempts to quantify the causes, and costs, of irreproducibility.

Well, if business rules can be defined that provide pointers to quality research, then Linguamatics text analytics solution, I2E, can pull these out. So, for example you might trust papers from particular journals, or authors, or institutions. Or look at the inter-relatedness of the authors and co-authors. Or experimental assay details. Has the protocol been used by others? Does the protocol, or assertion, appear in patents? Or, is a particular fact evidenced more than once? Many times but in the same data source? Many times across different data sources?

None of these “rules” will guarantee the accuracy of your fact, but they all provide weight towards a measure of reliability. And, other than going back to the bench yourself, it’s tricky to know how else to make such judgements. 

* NB meeting was: EBI-EMBL Industry Workshop: Data enhancement through scientific literature workshop, June 17th 2015.