The CALBC Silver Standard Corpus - Harmonizing multiple semantic annotations in a large biomedical corpus

Rebholz-Schuhmann D, Jimeno Yepes AJ, Van Mulligen EM, Kang N, Kors J, Milward D, Corbett P, Hahn U.

Proc 3rd Int Symp Languages Biology Medicine. 2009 Nov; pp64-72



The CALBC initiative aims to provide a large-scale biomedical text corpus that contains semantic annotations for tagged named entities of different kinds. The generation of this corpus requires that the annotations from different automatic annotation systems are harmonized.

In the first phase, the annotation systems from 5 participants (EMBL-EBI, EMC Rotterdam, NLM, JULIE Lab Jena, and Linguamatics) were gathered.

All annotations were delivered in a common annotation format that included concept ids in the boundary assignments and that enabled comparison and alignment of the results. During the harmonization phase, the produced results from different systems have been integrated into a single harmonised corpus (“silver standard” corpus) by applying a voting scheme. We give an overview of the processed data and the principles of harmonization – formal boundary reconciliation and semantic matching of named entities. Finally all submissions of the participants have been evaluated against the silver standard corpus.

We found that species and disease annotations are better standardised amongst the partners than the annotations of genes and proteins. The raw corpus is now available for additional named entity annotations. Part of the annotated corpus will be made available later for a public challenge.

We expect that we can improve corpus building activities both in terms of the numbers of named entity classes being covered, as well as the size of the corpus in terms of annotated documents.