I2E Chemistry provides the first interactive text mining system designed for chemistry.
By integrating chemical search from ChemAxon and text mining from Linguamatics, users can identify chemical structures and understand their role within documents such as patents, scientific articles, and internal documents. Applications include scientific research, intellectual property search, and commercial intelligence.Applications include scientific research, intellectual property search, and commercial intelligence.
I2E Chemistry allows users to extract all chemical structures within documents or to flter based on substructure or similarity. It allows extraction of chemicals, associated properties and any relationship with other entities.The range of chemical search capabilities to help find both known and novel compounds, using:
- Exact, substructure and similarity search
- Dictionary based search for known chemicals e.g. common names
- Name to structure to find novel chemicals
I2E’s agile NLP-based text mining approach allows users to discover a wide range of chemical information, such as:
- Structure activity relationships (SAR)
- Melting point or boiling point of a chemical
- The role of a chemical in a reaction
- Concentration of chemicals
What chemicals with this substructure act as inhibitors
Faster and more efficient identification of chemicals and associated information
Fully automated extraction of chemical information
- Avoid costly manual chemical mark-up
- Chemical searching across any document (internal and external)
- Scalable to millions of large documents e.g. full text patents
Chemistry-enabled Text Mining
ChemAxon’s JChem library has been integrated with I2E to allow automatic recognition and extraction of chemical entities embedded in documents. The I2E Chemistry module provides state-of-the-art chemical entity recognition by combining a number of strategies:
Dictionary matching of: layman or archaic names, e.g. saltpeter, slaked lime, borax; traditional, IUPAC names, e.g. sodium nitrate, calcium hydroxide, sodium tetraborate decahydrate; and drug of cial, marketing, or slang names.
- Formulae recognition, e.g. NaNO3, Ca(OH)2, Na2B4O7•10H2O.
- Recognition of systematic name as defined by IUPAC.
- Novel compound recognition using ChemAxon Name-to-Structure.
Identify and Extract Chemicals from Internal and External Documents
I2E Chemistry can process documents in a wide range of formats such as PDF, XML, docx, tsv, allowing internal document silos to become available for chemical text mining.
Taking an automated approach to identify chemical structures avoids costly and time consuming manual chemical mark-up. Searches can also be run across both internal and external documents, such as safety reports, regulatory documents, clinical information, or scientific literature and patents.
Why is it Different from Searching using Chemical Structures?
Text mining can answer specific questions much more efficiently than document search e.g. show me all chemicals used for capsule coatings combined with gelatine returns a list of chemicals with associated evidence, rather than a list of documents to read.
What are the Advantages of Fully Automatic Methods for Finding Chemical Structures?
This allows structures to be found in documents where mark-up by hand has not yet been done, or is uneconomic to do e.g. for a company’s internal reports. It is highly scalable and can find chemical structures at particular points within a document so you can ask questions such as “which chemicals are mentioned in the same list as chemical X”.
Save and Share Searches for Automation and Reproducibility
Chemical substructure searches can be saved and re-run on any set of documents. The portable nature of the chemical queries allows them to be shared with people you are collaborating with.
Chemistry-enabled Text Mining Use Case
Roche opted to use Linguamatics I2E NLP solution to extract the relevant compound/target/disease information from a broad range of published and internal sources into a structured and analytics-friendly format. I2E is ideally suited to deal with the variability found in drug-related text and data and uses NLP, taxonomies, thesauri, and ontologies to detect and extract drug names, targets, diseases, and their relationships no matter how they are expressed.
Roche augmented the I2E text mining solution with ChemAxon’s chemical annotation and name-to-structure tools to extract the maximum amount of chemical structural information from the text sources. Read the full article to find out more about how medicinal chemists can overcome the text big data deluge, or download the full case study on Text Mining at Roche pRED.