Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.
This section of our website provides an introduction to these technologies, and highlights some of the features that contribute to an effective solution. A brief (90-second) video on natural language processing and text mining is also provided below.
Widely used in knowledge-driven organizations, text mining is the process of examining large collections of documents to discover new information or help answer specific research questions.
Text mining identifies facts, relationships and assertions that would otherwise remain buried in the mass of textual big data. Once extracted, this information is converted into a structured form that can be further analyzed, or presented directly using clustered HTML tables, mind maps, charts, etc. Text mining employs a variety of methodologies to process the text, one of the most important of these being Natural Language Processing (NLP).
The structured data created by text mining can be integrated into databases, data warehouses or business intelligence dashboards and used for descriptive, prescriptive or predictive analytics.
Natural Language Understanding helps machines “read” text (or another input such as speech) by simulating the human ability to understand a natural language such as English, Spanish or Chinese. Natural Language Processing includes both Natural Language Understanding and Natural Language Generation, which simulates the human ability to create natural language text e.g. to summarize information or take part in a dialogue.
As a technology, natural language processing has come of age over the past ten years, with products such as Siri, Alexa and Google's voice search employing NLP to understand and respond to user requests. Sophisticated text mining applications have also been developed in fields as diverse as medical research, risk management, customer care, insurance (fraud detection) and contextual advertising.
Today’s natural language processing systems can analyze unlimited amounts of text-based data without fatigue and in a consistent, unbiased manner. They can understand concepts within complex contexts, and decipher ambiguities of language to extract key facts and relationships, or provide summaries. Given the huge quantity of unstructured data that is produced every day, from electronic health records (EHRs) to social media posts, this form of automation has become critical to analysing text-based data efficiently.
Machine learning is an artificial intelligence (AI) technology which provides systems with the ability to automatically learn from experience without the need for explicit programming, and can help solve complex problems with accuracy that can rival or even sometimes surpass humans.
However, machine learning requires well-curated input to train from, and this is typically not available from sources such as electronic health records (EHRs) or scientific literature where most of the data is unstructured text.
When applied to EHRs, clinical trial records or full text literature, natural language processing can extract the clean, structured data needed to drive the advanced predictive models used in machine learning, thereby reducing the need for expensive, manual annotation of training data.
In this 15-minute presentation, David Milward, CTO of Linguamatics, discusses AI in general, AI technologies such as natural language processing and machine learning and how NLP and machine learning can be combined to create different learning systems.
While traditional search engines like Google now offer refinements such as synonyms, auto-completion and semantic search (history and context), the vast majority of search results only point to the location of documents, leaving searchers with the problem of having to spend hours manually extracting the necessary data by reading through individual documents.
The limitations of traditional search are compounded by the growth in big data over the past decade, which has helped increase the number of results returned for a single query by a search engine like Google from tens of thousands to hundreds of millions.
The healthcare and biomedical sectors are no exception. A December 2018 study by the International Data Corporation (IDC) found that the volume of big data is projected to grow faster in healthcare than in manufacturing, financial services or media over the next seven years: experiencing a compound annual growth rate (CAGR) of 36%.
With the growth of textual big data, the use of AI technologies such as natural language processing and machine learning becomes even more imperative.
Ontologies, vocabularies and custom dictionaries are powerful tools to assist with search, data extraction and data integration. They are a key component of many text mining tools, and provide lists of key concepts, with names and synonyms often arranged in a hierarchy.
Search engines, text analytics tools and natural language processing solutions become even more powerful when deployed with domain-specific ontologies. Ontologies enable the real meaning of the text to be understood, even when it is expressed in different ways (e.g. Tylenol vs. Acetaminophen). NLP techniques extend the power of ontologies, for example by allowing matching of terms with different spellings (Estrogen or Oestrogen), and by taking context into account (“SCT” can refer to the gene, “Secretin”, or to “Stair Climbing Test”).
The specification of an ontology includes a vocabulary of terms and formal constraints on its use. Enterprise-ready natural language processing requires range of vocabularies, ontologies and related strategies to identify concepts in their correct context:
Linguamatics provides a number of standard terminologies, ontologies and vocabularies as part of its natural language processing platform. More information can be found on our Ontologies page.
The use of advanced analytics represents a real opportunity within the pharmaceutical and healthcare industries, where the challenge lies in selecting the appropriate solution, and then implementing it efficiently across the enterprise.
Effective natural language processing requires a number of features that should be incorporated into any enterprise-level NLP solution, and some of these are described below.
There is huge variety in document composition and textual context, including sources, format, language and grammar. Tackling this variety requires a range of methodologies:
An open architecture that allows for the integration of different components is now a crucial aspect in the development of enterprise systems, and there are a number of key standards in this area:-
Partnerships are a critical enabler for industry innovators to access the tools and technologies needed to transform data across the enterprise.
Linguamatics partners and collaborates with numerous companies, academic and governmental organizations to bring customers the right technology for their needs and develop next generation solutions. Visit our Partners and Affiliations page for more on our technology and content partnerships.
An effective user interface broadens access to natural language processing tools, rather than requiring specialist skills to use them (e.g. programming expertise, command line access, scripting).
A productive NLP solution provides a range of ways to access the platform to accommodate the business needs and skill sets across the organisation, such as:
Text-mining challenges vary in size, from occasional access to a few documents to federated searches over multiple silos and millions of documents. A modern natural language processing solution must therefore:
For more information on selecting the right tools for your business needs, please read our guide on Choosing the right NLP Solution for your Business.
To learn more about Linguamatics NLP platform, visit our products section.