Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.
This section of our website provides an introduction to these technologies, and highlights some of the features that contribute to an effective solution. A brief (90-second) video on natural language processing and text mining is also provided below.
What is Text Mining?
Widely used in knowledge-driven organizations, text mining is the process of examining large collections of documents to discover new information or help answer specific research questions.
Text mining identifies facts, relationships and assertions that would otherwise remain buried in the mass of textual big data. Once extracted, this information is converted into a structured form that can be further analyzed, or presented directly using clustered HTML tables, mind maps, charts, etc. Text mining employs a variety of methodologies to process the text, one of the most important of these being Natural Language Processing (NLP).
The structured data created by text mining can be integrated into databases, data warehouses or business intelligence dashboards and used for descriptive, prescriptive or predictive analytics.
What is Natural Language Processing (NLP)?
Natural Language Understanding helps machines “read” text (or another input such as speech) by simulating the human ability to understand a natural language such as English, Spanish or Chinese. Natural Language Processing includes both Natural Language Understanding and Natural Language Generation, which simulates the human ability to create natural language text e.g. to summarize information or take part in a dialogue.
As a technology, natural language processing has come of age over the past ten years, with products such as Siri, Alexa and Google's voice search employing NLP to understand and respond to user requests. Sophisticated text mining applications have also been developed in fields as diverse as medical research, risk management, customer care, insurance (fraud detection) and contextual advertising.
Today’s natural language processing systems can analyze unlimited amounts of text-based data without fatigue and in a consistent, unbiased manner. They can understand concepts within complex contexts, and decipher ambiguities of language to extract key facts and relationships, or provide summaries. Given the huge quantity of unstructured data that is produced every day, from electronic health records (EHRs) to social media posts, this form of automation has become critical to analysing text-based data efficiently.
Machine Learning and Natural Language Processing
Machine learning is an artificial intelligence (AI) technology which provides systems with the ability to automatically learn from experience without the need for explicit programming, and can help solve complex problems with accuracy that can rival or even sometimes surpass humans.
However, machine learning requires well-curated input to train from, and this is typically not available from sources such as electronic health records (EHRs) or scientific literature where most of the data is unstructured text.
When applied to EHRs, clinical trial records or full text literature, natural language processing can extract the clean, structured data needed to drive the advanced predictive models used in machine learning, thereby reducing the need for expensive, manual annotation of training data.
In this 15-minute presentation, David Milward, CTO of Linguamatics, discusses AI in general, AI technologies such as natural language processing and machine learning and how NLP and machine learning can be combined to create different learning systems.
Big Data and the Limitations of Keyword Search
While traditional search engines like Google now offer refinements such as synonyms, auto-completion and semantic search (history and context), the vast majority of search results only point to the location of documents, leaving searchers with the problem of having to spend hours manually extracting the necessary data by reading through individual documents.
The limitations of traditional search are compounded by the growth in big data over the past decade, which has helped increase the number of results returned for a single query by a search engine like Google from tens of thousands to hundreds of millions.
The healthcare and biomedical sectors are no exception. A December 2018 study by the International Data Corporation (IDC) found that the volume of big data is projected to grow faster in healthcare than in manufacturing, financial services or media over the next seven years: experiencing a compound annual growth rate (CAGR) of 36%.
IDC White Paper: The Digitization of the World from Edge to Core.
With the growth of textual big data, the use of AI technologies such as natural language processing and machine learning becomes even more imperative.
Ontologies, Vocabularies and Custom Dictionaries
Ontologies, vocabularies and custom dictionaries are powerful tools to assist with search, data extraction and data integration. They are a key component of many text mining tools, and provide lists of key concepts, with names and synonyms often arranged in a hierarchy.
Search engines, text analytics tools and natural language processing solutions become even more powerful when deployed with domain-specific ontologies. Ontologies enable the real meaning of the text to be understood, even when it is expressed in different ways (e.g. Tylenol vs. Acetaminophen). NLP techniques extend the power of ontologies, for example by allowing matching of terms with different spellings (Estrogen or Oestrogen), and by taking context into account (“SCT” can refer to the gene, “Secretin”, or to “Stair Climbing Test”).
The specification of an ontology includes a vocabulary of terms and formal constraints on its use. Enterprise-ready natural language processing requires range of vocabularies, ontologies and related strategies to identify concepts in their correct context:
- Thesauri, vocabularies, taxonomies and ontologies for concepts with known terms;
- Pattern-based approaches for categories such as measurements, mutations and chemical names that can include novel (unseen) terms;
- Domain-specific, rule-based concept identification, annotation and transformation;
- Integration of customer vocabularies to enable bespoke annotation;
- Advanced search to enable the identification of data ranges for dates, numerical values, area, concentration, percentage, duration, length and weight.
Linguamatics provides a number of standard terminologies, ontologies and vocabularies as part of its natural language processing platform. More information can be found on our Ontologies page.
Enterprise-Level Natural Language Processing
The use of advanced analytics represents a real opportunity within the pharmaceutical and healthcare industries, where the challenge lies in selecting the appropriate solution, and then implementing it efficiently across the enterprise.
Effective natural language processing requires a number of features that should be incorporated into any enterprise-level NLP solution, and some of these are described below.
There is huge variety in document composition and textual context, including sources, format, language and grammar. Tackling this variety requires a range of methodologies:
- Transformation of internal and external document formats (e.g. HTML, Word, PowerPoint, Excel, PDF text, PDF image) into a standardized searchable format;
- The ability to identify, tag and search in specific document sections (areas), for example: focusing a search to remove noise from a paper’s reference section;
- Linguistic processing to identify the meaningful units within text such as sentences, noun and verb groups together with the relationships between them;
- Semantic tools that identify concepts within the text such as drugs and diseases, and normalize to concepts from standard ontologies. In addition to core life science and healthcare ontologies such as MedDRA and MeSH, the ability to add their own dictionaries is a requirement for many organizations;
- Pattern recognition to discover and identify categories of information, not easily defined with a dictionary approach. These include dates, numerical information, biomedical terms (e.g. concentration, volume, dosage, energy) and gene/protein mutations;
- The ability to process embedded tables within the text, whether formatted using HTML or XML, or as free text.
An open architecture that allows for the integration of different components is now a crucial aspect in the development of enterprise systems, and there are a number of key standards in this area:-
- A RESTful Web Services API supports integration with document processing workflows;
- A declarative query language that is human readable and accessible for all NLP functionality (e.g. queries, search terms, context and display settings);
- The ability to transform and integrate extracted data into a common infrastructure for master data management (MDM) or distributed processing with e.g. Hadoop.
Partnerships are a critical enabler for industry innovators to access the tools and technologies needed to transform data across the enterprise.
Linguamatics partners and collaborates with numerous companies, academic and governmental organizations to bring customers the right technology for their needs and develop next generation solutions. Visit our Partners and Affiliations page for more on our technology and content partnerships.
An effective user interface broadens access to natural language processing tools, rather than requiring specialist skills to use them (e.g. programming expertise, command line access, scripting).
A productive NLP solution provides a range of ways to access the platform to accommodate the business needs and skill sets across the organisation, such as:
- An intuitive graphical user interface (GUI) that avoids the need for users to write scripts;
- Web portals that enable access by non-technical users;
- An interface to search and browse ontologies;
- An administration interface to control access to data, and allow indexes to be processed on behalf of many users;
- A broad range of out-of-the-box query modules, enabling domain experts to ask questions without the need to understand the underlying linguistics.
Text-mining challenges vary in size, from occasional access to a few documents to federated searches over multiple silos and millions of documents. A modern natural language processing solution must therefore:
- Provide the ability to run sophisticated queries over tens of millions of documents, each of which may be thousands of pages long;
- Handle vocabularies and ontologies containing millions of terms;
- Run on parallel architectures, whether standard multi-core, cluster or cloud;
- Provide a connector to run natural language processing in service-oriented environments such as ETL (Extract, Transform, Load), semantic enrichment and signal detection, for example: clinical risk monitoring in healthcare.
For more information on selecting the right tools for your business needs, please read our guide on Choosing the right NLP Solution for your Business.
To learn more about Linguamatics NLP platform, visit our products section.