By integrating chemical search from ChemAxon and text mining from Linguamatics, I2E users can identify chemical structures and understand their role within documents such as patents, scientific articles, and internal documents. Applications include scientific research, intellectual property search, and commercial intelligence.
Key Benefits
- Faster and more efficient identification of chemicals and associated information
- Fully automated extraction of chemical information
- Avoid costly manual chemical mark-up
- Chemical searching across any document (internal and external)
Scalable to millions of large documents e.g. full text patents, I2E Chemistry allows users to extract all known and novel chemical structures within documents. Alternatively, subsets can be specified by filtering based on structural substructure or similarity. The range of chemical search capabilities to help find both known and novel compounds, including:
- Exact, substructure and similarity search
- Dictionary based search for known chemicals e.g. common names
- Name to structure to find novel chemicals
I2E’s agile NLP-based text mining approach allows extraction of chemicals, associated properties and any relationship with other entities to discover a wide range of chemical information, such as:
- Structure activity relationships (SAR)
- Melting point or boiling point of a chemical
- The role of a chemical in a reaction
- Concentration of chemicals
Chemicals that contain this substructure act as inhibitors against which targets?
Integration with ChemAxon Name to Structure
ChemAxon’s JChem library has been integrated with I2E document processing to allow automatic recognition and extraction of chemical entities embedded in documents. The I2E Chemistry module provides state-of-the-art chemical entity recognition by combining a number of strategies:
- Dictionary matching of: layman or archaic names, e.g. saltpeter, slaked lime, borax; traditional, IUPAC names, e.g. sodium nitrate, calcium hydroxide, sodium tetraborate decahydrate; and drug of cial, marketing, or slang names.
- Formulae recognition, e.g. NaNO3, Ca(OH)2, Na2B4O7•10H2O.
- Recognition of systematic name as defined by IUPAC.
- Novel compound recognition using ChemAxon Name-to-Structure.
Identify and Extract Chemicals from Internal and External Documents
I2E Chemistry can process documents in a wide range of formats such as PDF, XML, docx, tsv, allowing internal document silos to become available for chemical text mining.
Taking an automated approach to identify chemical structures avoids costly and time consuming manual chemical mark-up. Searches can also be run across both internal and external documents, such as safety reports, regulatory documents, clinical information, or scientific literature and patents.
Why is it Different from Searching using Chemical Structures?
Text mining can answer specific questions much more efficiently than document search e.g. showing all chemicals used for capsule coatings combined with gelatine returns a list of chemicals with associated evidence, rather than a list of documents to read.
What are the Advantages of Fully Automatic Methods for Finding Chemical Structures?
This allows structures to be found in documents where mark-up by hand has not yet been done, or is uneconomic to do e.g. for a company’s internal reports. It is highly scalable and can find chemical structures at particular points within a document so you can ask questions such as “which chemicals are mentioned in the same list as chemical X”.
Save and Share Searches for Automation and Reproducibility
Chemical substructure searches can be saved and re-run on any set of documents. The portable nature of the chemical queries allows them to be shared with people you are collaborating with.
Use cases
Roche opted to use Linguamatics I2E NLP solution to extract the relevant compound/target/disease information from a broad range of published and internal sources into a structured and analytics-friendly format. I2E is ideally suited to deal with the variability found in drug-related text and data and uses NLP, taxonomies, thesauri, and ontologies to detect and extract drug names, targets, diseases, and their relationships no matter how they are expressed.
Roche augmented the I2E text mining solution with ChemAxon’s chemical annotation and name-to-structure tools to extract the maximum amount of chemical structural information from the text sources. Read the full article to find out more about how medicinal chemists can overcome the text big data deluge, or download the full case study on Text Mining at Roche pRED.