Scalable Text Mining for Extract Transform Load (ETL) Solutions

Data Transformation: The Challenge
Extracting Unstructured Data from Source Systems

Organizations are embracing the digital revolution, but digital transformation demands data transformation, in order to get the full value from disparate data across the organization. Integrating data from a variety of sources into a data warehouse or other data repository centralizes business-critical data, and speeds up finding and analyzing important data.

The Extract, Transform, and Load (ETL) process of extracting data from source systems and bringing it into databases or warehouses is well established. While many ETL tools can handle structured data, very few can reliably process unstructured data and documents. It’s well-known that the majority of data is unstructured:

  • 65 - 80% of life sciences and patient information is unstructured
  • 35% of research project time is spent in data curation

And this means life science and healthcare organizations continue to face big challenges when it comes to fully realizing the value of their data.

Data Transformation: The Solution
Using NLP to Structure Data

Data transformation with percentage of types

Linguamatics fills this value gap in ETL projects, providing solutions that are specifically designed to address unstructured data extraction and transformation on a large scale. Linguamatics I2E NLP-based text mining software extracts concepts, assertions and relationships from unstructured data and transforms them into structured data to be stored in databases/data warehouses.

Linguamatics automation, powered by I2E AMP can scale operations up to address big data volume, variety, veracity and velocity. I2E AMP manages multiple I2E servers for indexing and querying, distributing resources, and buffering incoming documents, and is powerful enough to handle millions of records.

Put simply, I2E is a powerful data transformation tool that converts unstructured text in documents into structured facts. Plugging I2E into workflows using I2E AMP (or other workflow tools such as KNIME) enables automation of data transformation, which means key information from unstructured text to be extracted and used downstream for data integration and data management tasks.

For technical details of I2E automation, please download our datasheet.

What are the Benefits of using NLP from Linguamatics for Data Transformation?

Using Linguamatics I2E, enterprises can create automated ETL processes to:

  • Easily generate insights from unstructured data to provide tabular or visual analytics to the end-user, or create structured data sets to support research data warehouses, analytical warehouses, machine learning models, and sophisticated search interfaces to support patient care.
  • Enhance existing investments in warehouses, analytics, and dashboards;
  • Provide comprehensive, precise and accurate data to end-users due to I2E’s unique strengths including: capturing precise relationships, finding concepts in appropriate context, quantitative data normalisation & extraction, processing data in embedded tables.

Customer Use Cases range from Drug Discovery Process to Improving Patient Care:

  • Regulatory compliance: extract Identification of Medicinal Products (IDMP) data elements from multilingual Summary Product Characteristics (SMPC) documents to feed Regulatory Information Management System (RIMS)
  • Target selection: extract data from patents to generate target-indication database & dashboard
  • Business intelligence: generate email alerts for clinical development and competitive intelligence teams by integrating and structuring data feeds from many sources
  • Patient risk: Extract information from clinical and call center notes to enable Population Stratification for payers and health plans
  • Improve clinical documentation: Provide a service to identify disease terms that are in clinical notes but missing from structure documentation
  • Streamline care: Extract pathology insights in real time to support improved treatment selection, reduced chart review and computational assessment of clinical trials and treatment selection

The Technical Details: Scalable Text Mining for ETL provides

Linguamatics I2E AMP

Scalable indexing 

  • Parallel indexing processes exploit multiple cores
  • Distributed indexing across machines

Scalable querying 

  • Distribution across cores
  • Distribution across machines

Federated architecture 

  • Support for load balancing
  • Scalable document processing pipelines

Distributed processes across machines 

  • I2E AMP Asynchronous messaging platform provides fault tolerant and scalable processing
  • Hadoop compatible 

Contact us to learn more