Ebola: text analytics over patent sources for medicinal chemistry

November 13 2014

The 2014 Ebola outbreak is officially the deadliest in history. Governments and organizations are searching for ways to halt the spread – both responding with humanitarian help, and looking for treatments to prevent or cure the viral infection. 

Ebola virus disease (or Ebola haemorrhagic fever) is caused by the Ebola filovirus 

A couple of weeks ago we received a tweet from Chris Southan, who has been looking at crowdsourcing anti-Ebola medicinal chemistry. He asked us to mine Ebola C07D patents (i.e. those for heterocyclic small molecules, the standard chemistry for most drugs) using our text analytics tool I2E, and provide him with the resulting chemical structures.

We wanted to help. What anti-Ebola research has been patented, that might provide value to the scientific community? Searching patents for chemistry using an automated approach is notoriously tricky; patent documents are long, and often purposefully obfuscated with chemicals frequently being obscured by the complex language used to described them or corrupted by OCR errors and destroyed by the overall poor formatting of the patents.

Andrew Hinton, one of our Application Specialists with a background in chemistry, used I2E to map the patent landscape around Ebola, identify patents for small molecules described to target Ebola, and extract the chemical structures. He compiled queries to answer the key questions and find those patents which were most relevant:

  • Does the patent mention Ebola or Ebola-like diseases? More importantly, is Ebola the major focus of the patent?
  • Who is the pharma or biotech company?
  • Is it a small molecule or non-small molecule patent?
  • What’s the exemplified chemistry? What’s the claimed chemistry? What’s the Markush chemistry?
  • What chemistry is found as an image? What chemistry is found in a table? Can we extract these structures too?

Andrew ran these queries over patents from USPTO, EPO and WIPO for the past 20 years on data derived from IFI CLAIMS.

Graph showing C07D patents (blue, left-hand axis) and non-C07D patents (red line, right-hand axis) for Ebola related patents from 1994 to 2014, from the three major patent registries [please note different scales for the axes].

The results showed a general increase in the number of patent records related to Ebola, but they are comparatively small – for example there were about 50k C07D patents published in 2010 across all therapeutic areas; of these, we found that only about 100 patents that related to Ebola (and the likely number of truly unique patent families is going to be a smaller subset of the above figure). This isn’t really that surprising; along with most viral diseases, the main emphasis for therapies has been on biologics and other non-small-molecule treatments – in fact, of the 16k total patents that mention Ebola, only 1% are C07D patents focused specifically on Ebola. 

Heatmap showing that the top 3 organizations with small molecule C07D patents in this area contribute 1/5 of all Ebola patents.

So what is the outcome of this? Using I2E, we have been able to extract the set of molecules reported in these Ebola-related patents, and will provide a set of these to Chris Southan for his chemoinformatics analysis. Let’s hope that this little step might further research towards providing a solution to the current Ebola epidemic. 

Screenshot of I2E results showing names and structures extracted using text analytics from Ebola-related patent set. Structures are generated using ChemAxon’s Name-to-Structure conversion tools.