Multilingual Text Mining in I2E 4.4

November 20 2015

Scientific papers are mainly written in English, so it is not surprising that most scientific text mining has concentrated on just one language. However, as the use of text mining has become broader, moving from early research through to clinical and post-marketing, there is increasing need to be able to deal with other languages. In the pharmaceutical sector, this is seen in projects ranging across voice of the customer, analysis of sales reports, adverse event monitoring, patent analysis, and checking the quality of regulatory submission documents. In healthcare, hospitals often have a multinational presence, and a need to collect information from records written in several languages.

Multilingual processing not only allows text mining in other languages (for example, a French medic analysing French electronic medical records), but also allows easier mining of foreign language documents, or across different languages. A couple of examples:

  • An English researcher can mine Chinese text using concepts they have found using the English synonyms, extract the relationships of interest, and then use something like Google translate to show the evidence within the original text.
  • A French medic can automatically link their medical records with relevant clinical trials in English

Linguamatics recognized this growing need and, in I2E 4.4, has provided a platform that can deal with multiple languages. It can even deal with cases such as patent documents where a single document contains text written in multiple languages, ensuring that an English synonym for adverse events such as “die” does not hit the German determiner “die”.

I2E 4.4 extends the languages supported out of the box, but also makes the platform more open to allow advanced linguistic processing to be easily integrated for new languages when required:

  • Tokenization (splitting sentences into words) can now deal with Japanese and Chinese text where there are typically no spaces between words
  • Stemming/ morphological variants covers 15 languages
  • French and German grammatical processing is provided in addition to English, with an example plug-in package for Spanish based on an open source NLP toolkit

If you are interested in trying out this functionality please contact us on enquires@linguamatics.com to discuss your requirements.