By integrating chemical search from ChemAxon and text mining from Linguamatics, I2E users can identify chemical structures and understand their role within documents such as patents, scientific articles, and internal documents. Applications include scientific research, intellectual property search, and commercial intelligence.
Scalable to millions of large documents e.g. full text patents, I2E Chemistry allows users to extract all known and novel chemical structures within documents. Alternatively, subsets can be specified by filtering based on structural substructure or similarity. The range of chemical search capabilities to help find both known and novel compounds, including:
I2E’s agile NLP-based text mining approach allows extraction of chemicals, associated properties and any relationship with other entities to discover a wide range of chemical information, such as:
ChemAxon’s JChem library has been integrated with I2E document processing to allow automatic recognition and extraction of chemical entities embedded in documents. The I2E Chemistry module provides state-of-the-art chemical entity recognition by combining a number of strategies:
I2E Chemistry can process documents in a wide range of formats such as PDF, XML, docx, tsv, allowing internal document silos to become available for chemical text mining.
Taking an automated approach to identify chemical structures avoids costly and time consuming manual chemical mark-up. Searches can also be run across both internal and external documents, such as safety reports, regulatory documents, clinical information, or scientific literature and patents.
Text mining can answer specific questions much more efficiently than document search e.g. showing all chemicals used for capsule coatings combined with gelatine returns a list of chemicals with associated evidence, rather than a list of documents to read.
This allows structures to be found in documents where mark-up by hand has not yet been done, or is uneconomic to do e.g. for a company’s internal reports. It is highly scalable and can find chemical structures at particular points within a document so you can ask questions such as “which chemicals are mentioned in the same list as chemical X”.
Chemical substructure searches can be saved and re-run on any set of documents. The portable nature of the chemical queries allows them to be shared with people you are collaborating with.
Roche opted to use Linguamatics I2E NLP solution to extract the relevant compound/target/disease information from a broad range of published and internal sources into a structured and analytics-friendly format. I2E is ideally suited to deal with the variability found in drug-related text and data and uses NLP, taxonomies, thesauri, and ontologies to detect and extract drug names, targets, diseases, and their relationships no matter how they are expressed.
Roche augmented the I2E text mining solution with ChemAxon’s chemical annotation and name-to-structure tools to extract the maximum amount of chemical structural information from the text sources. Read the full article to find out more about how medicinal chemists can overcome the text big data deluge, or download the full case study on Text Mining at Roche pRED.