Chemistry-enabled text mining

Chemical Text Mining

I2E Chemistry provides the first interactive text mining system designed for chemistry.

By integrating chemical search from ChemAxon and text mining from Linguamatics, users can identify chemical structures and understand their role within documents such as patents, scientific articles, and internal documents. Applications include scientific research, intellectual property search, and commercial intelligence.Applications include scientific research, intellectual property search, and commercial intelligence.

I2E Chemistry allows users to extract all chemical structures within documents or to flter based on substructure or similarity. It allows extraction of chemicals, associated properties and any relationship with other entities.The range of chemical search capabilities to help find both known and novel compounds, using:

  • Exact, substructure and similarity search
  • Dictionary based search for known chemicals e.g. common names
  • Name to structure to find novel chemicals

I2E’s agile NLP-based text mining approach allows users to discover a wide range of chemical information, such as: 

  • Structure activity relationships (SAR)
  • Melting point or boiling point of a chemical
  • The role of a chemical in a reaction
  • Concentration of chemicals 

What chemicals with this substructure act as inhibitors

Key Benefits

  • Faster and more efficient identification of chemicals and associated information
  • Fully automated extraction of chemical information
  • Avoid costly manual chemical mark-up
  • Chemical searching across any document (internal and external)
  • Scalable to millions of large documents e.g. full text patents

Chemistry-enabled text mining

ChemAxon’s JChem library has been integrated with I2E to allow automatic recognition and extraction of chemical entities embedded in documents. The I2E Chemistry module provides state-of-the-art chemical entity recognition by combining a number of strategies:

  • Dictionary matching of: layman or archaic names, e.g. saltpeter, slaked lime, borax; traditional, IUPAC names, e.g. sodium nitrate, calcium hydroxide, sodium tetraborate decahydrate; and drug of cial, marketing, or slang names.
  • Formulae recognition, e.g. NaNO3, Ca(OH)2, Na2B4O7•10H2O.
  • Recognition of systematic name as defined by IUPAC.
  • Novel compound recognition using ChemAxon Name-to-Structure. 

Identify and extract chemicals from internal and external documents

I2E Chemistry can process documents in a wide range of formats such as PDF, XML, docx, tsv, allowing internal document silos to become available for chemical text mining.

Taking an automated approach to identify chemical structures avoids costly and time consuming manual chemical mark-up. Searches can also be run across both internal and external documents, such as safety reports, regulatory documents, clinical information, or scientific literature and patents. 

Why is it different from searching using chemical structures?

Text mining can answer specific questions much more efficiently than document search e.g. show me all chemicals used for capsule coatings combined with gelatine returns a list of chemicals with associated evidence, rather than a list of documents to read. 

What are the advantages of fully automatic methods for finding chemical structures?

This allows structures to be found in documents where mark-up by hand has not yet been done, or is uneconomic to do e.g. for a company’s internal reports. It is highly scalable and can find chemical structures at particular points within a document so you can ask questions such as “which chemicals are mentioned in the same list as chemical X”.

Save and share searches for automation and reproducibility

Chemical substructure searches can be saved and re-run on any set of documents. The portable nature of the chemical queries allows them to be shared with people you are collaborating with.

For more information and case studies: 

Press release: Linguamatics broadens the utility of its market-leading text mining platform with the launch of a new web services api, virtual data integration and new chemistry capabilities

Ebola: Text analytics over patent sources for medicinal chemistry 

Maximize the Use of I2E Chemistry with new Linkout Capabilities

Text mining - increasing the speed to insight for chemists in the life sciences