Skip to main content

Natural language Processing (NLP) - a technical deep dive

  1. Configurable NLP pipeline
  2. Relationship Extraction
  3. NLP Fundamentals
  4. XML Parsing
  5. Tokenization
  6. Part-of-Speech Tagging
  7. Chunking
  8. Sentence Splitting
  9. Concept Matching
  10. Stemming
  11. An Open Pipeline for Effective Access to Key Components
  12. Applying NLP at scale

Any data scientist looking to unlock more value from their data will turn their attention to free text (where it is estimated 80% of data resides). In order to use conventional tools, methodologies and skills to analyze this data, information buried in the free text needs to be unlocked and converted to a structured format.

The Natural Language Processing (NLP) Platform that underpins all Linguamatics’ products provides an evolving set of components which are scalable, robust and evaluated both for accuracy and performance. At the core of the platform is the NLP used to enrich every piece of text in multiple ways (Figure 1). Our NLP Platform provides an interface between the specialist work of data scientists on extracting value from text, and the clinical stakeholders who need to interpret and validate the results of this work.

Natural Language Processing finds information however it is expressed

Figure 1 Natural Language Processing finds information however it is expressed

Configurable NLP pipeline

While most users choose to run our NLP pipeline (Figure 2) with minimal configuration using out-of-the-box components, it is possible to modify the NLP steps in significant ways, allowing easy experimentation and evaluation of new methods:

  • Appropriate tokenizers, taggers and chunkers can be specified collectively or individually for different languages.
  • Named entity recognition and entity linking  can use custom models, e.g., fine-tuned models using BERT for recognizing named entities such as organization and person names.
  • Methods for pre-processing input text and post-processing query output can be customized.

Figure 2 The NLP Connector provides a single interface for using the multiple APIs in the Linguamatics NLP Pipeline

Figure 2 The NLP Connector provides a single interface for using the multiple APIs in the Linguamatics NLP Pipeline

Relationship Extraction

Linguamatics NLP can be used to structure the unstructured, providing JSON, XML or CSV to feed further processing e.g. Machine Learning classifiers. It can also be used for very precise extractions of information, pulling out relationships between concepts in particular contexts ranging from protein-protein interactions to lymph node involvement in cancers. These precise results can feed further processing or be presented directly to end users as web pages, Excel spreadsheets or charts.

To achieve this, the Linguamatics platform provides a declarative query language on top of an index which is created from the linguistic processing pipeline. The index allows for very fast interactive querying of millions of documents. This is invaluable for data-driven interactive development of extraction strategies, as well as providing end-users the ability for interactive, agile text mining, similar to using a search engine.

The declarative queries leverage information that was computed during the indexing phase using a variety of machine learning and rule-based methods, including:

  • Linguistic units such as noun groups and verb groups.
  • Named entities such as diseases at different granularities e.g., “diseases”, “cancers” or “prostate cancers”.
  • Positional information within documents and sentences.
  • Document section location or file meta-data.

In addition, the query engine can identify text defined by substrings, wildcards and regular expressions and process rules that combine all the above with AND, OR, NOT and other operations (Figure 3).

Figure 3 Blend of Methods Transform Unstructured Data into Structured Information

Figure 3 Blend of Methods Transform Unstructured Data into Structured Information

NLP Fundamentals

The Linguamatics NLP Platform handles many diverse types of documents including PDFs and office documents such as Word, Excel and Power Point as well as healthcare specific documents such as HL7 and CCDA. The simplest case is a plain text file or an XML file. A plain text file is often enriched at the beginning of the process to identify sections or inject additional meta-data into the document to form an XML file.

XML Parsing

Omitted for plain text files, XML documents are parsed using a standard XML library to separate text content from XML tags and attributes but retaining the overall structure of the document. In this way, it is possible to focus extraction on specific parts of a document or, ignore one of one section of a document, where these sections can be defined via XML tags, attributes, or both.

Every subsequent NLP step looks at the text content, with the system keeping track of where in the document it is located.


In this phase, every sentence is split into tokens: generally, words, but splitting only on white space will leave punctuation attached to strings and compound words stuck together. Our tokenization rules have been developed and refined over many years to get a good, robust balance of keeping strings together in the optimum way.

At this stage, each token is being tracked with its position in its sentence and its position within the document (and position within its section, if it was obtained from an XML file).

Part-of-Speech Tagging

The Linguamatics part-of-speech tagger assigns each token a part of speech e.g., noun, verb, adjective based on its context. It uses a machine learning model that has been modified and updated over time to reflect our experience with real-world biomedical data. Each token is assigned a probability within a context of being one or more part of speech: noun, verb, adjective, preposition, etc.


Linguamatics groups tokens into chunks (noun groups or verb groups) based on their part of speech. Chunks can be useful to provide extra distance or as linguistic wildcards for data-driven terminology discovery  (see noun groups and verb groups in Figure 3).

Sentence Splitting

The next phase involves identifying and splitting the text on sentence boundaries (Figure 3). This is an intermediate step to full tokenization (see next phase) and it allows for sentence co-occurrence queries to be performed: these are a particularly useful trade-off between high recall (and low precision) document co-occurrence searches and high precision (and low recall) fixed-pattern searches. Complex rules have been developed to ensure that sentence splitting is as accurate as possible to account for real world usage: e.g., abbreviations ending with periods may not always be at the end of sentences and misplaced decimal places should not needlessly split sentences.

Concept Matching

Linguamatics NLP can perform both named entity recognition (NER) , named entity identification (NEI) for concept matching.

The most used method for concept matching is ontology matching.

Ontologies are organized by hierarchy with normalized values to provide users options to search for different levels of granularity. For example, in Table 1, “breast cancer” and “carcinoma of the breast” describe the same concept, in an ontology, they could be made into synonyms for a node with the normalized value “Breast neoplasm”. The node “Breast Neoplasm” could be a child node to a parent node describing the concept “Neoplasm” and/or another parent node describes the concept “Breast Disease”. “Neoplasm” and “Breast Diseases” could also be child nodes to a parent node “Diseases”. Users have the option to choose the node “Breast Neoplasm” for the specific disease, “Neoplasm” if they want to search for any cancer, “Breast Diseases” for any breast related diseases, or “Diseases” if they want to find all diseases mentioned.

 Ontologies are defined in Linguamatics in two ways: thesauri and patterns.

The thesaurus approach usually takes a list of terms organized by hierarchy (cyclosporine is an immunosuppressant, so Cyclosporine is a child of Immunosuppressants) and synonymy (Neoral is another name for cyclosporine, so Neoral is a synonym of cyclosporine) to assemble controlled or curated domain vocabularies. In this way, the token “Neoral” found in a document will be annotated as the concept Cyclosporine (and any unique concept identifier associated with the term) as well as annotated as an Immunosuppressant (and any other family terms it belongs to, e.g., Pharmacologic Substance). In this way, terms in the document can be standardized against agreed terminologies, e.g., MedDRA, SNOMED and ICD-10.

The pattern approach is very different and is based on pre-defined regular expressions that scan one or more tokens. Each regular expression will be looking to identify something specific: a full date, a mutation, a dosage, a temperature, etc.: once it finds a match, it will then identify the key data in the regular expression to provide a normalized version of those tokens, e.g., a temperature in degrees Celsius, or a full date in YYYY/MM/DD format.

Besides the dictionary-based thesaurus and pattern approach, it is also possible to plug in external NER models to annotate text and generate ontologies at indexing time through NER API (Figure 2).

Table 1 Examples of normalizing free text

Examples of normalizing free text

Ontology matching is highly configurable with the ability to base ontology and document comparisons on fuzzy versions of words, morphological variants of words (see stemming, below); to take into account possible spelling mistakes or OCR (Optical Care Recognition) errors in words; or to take into account whether terms should be compared based on glyph or accent differences. Linguamatics employs several methods for synonym expansion, including linguistic expansions and the use of deep learning techniques to discover missing terms.


To allow ontology matches and other search terms to be compared on morphological variants, Linguamatics run a stemming algorithm on tokens to normalize them (Figure 4). This is not a simple stemming process, so the end results are more like lemmatization: for example, is/are/were/am are normalized to be.

Linguistic Processing Using NLP

Figure 4 Linguistic Processing Using NLP

An Open Pipeline for Effective Access to Key Components

Linguamatics NLP platform has an open architecture which enables flexible use of the different tools and components.

The NLP pipeline provides call-out points at each step in the process (Figure 2) to communicate with external tokenizers, taggers, chunkers and NER tools. To simplify this process, Linguamatics provides an NLP Connector, which allows users to interact with the pipeline in Python, that simplifies those various call-outs to a single interface. This makes the call-outs highly configurable, powerful, and modular. As a part of NLP Connector, Linguamatics also provides ready-to-use services to call pre-trained CRF models for tagging different languages. This allows users to extend the NLP capabilities of the system by bringing in external NLP models at all stages throughout the process.

Applying NLP at scale

Our technology provides a robust and configurable mechanism for applying NLP at scale. Deployable on premise or in cloud, database connectors and a configurable webhook framework sit alongside an orchestration engine which parallelizes the Natural Language Processing to enable horizontal scaling for large volume, efficient NLP. Components in the system are containerized for simpler management. Furthermore, the full system is deployed using Kubernetes or equivalent, allowing for simpler installation, easier service monitoring and automated scaling of the system. The NLP Data Factory can be implemented with our out of the box queries, but also with NLP algorithms that you build with the highly configurable open NLP pipeline described above.

Ready to get started?

Request a Demo

Questions? Ask our experts