The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers

Rebholz-Schuhmann, Dietrich; Jimeno-Yepes, Antonio José; van Mulligen, Erik; Kang, Ning; Kors, Jan; Milward, David; Corbett, Peter; Buyko, Ekaterina; Tomanek, Katrin; Beisswanger, Elena; Hahn, Udo

D. Rebholz-Schuhmann (Dietrich), A.J. Jimeno-Yepes (Antonio José), E.M. van Mulligen (Erik), N. Kang (Ning), J.A. Kors (Jan), D. Milward (David), P. Corbett (Peter), E. Buyko (Ekaterina), K. Tomanek (Katrin), E. Beisswanger (Elena), et al. U. Hahn (Udo)

2010

The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers

Presented at the 7th International Conference on Language Resources and Evaluation, LREC 2010 (May 2010), Valletta

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the 'silver standard corpus' (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15, 956, 841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.

Additional Metadata
Persistent URL	hdl.handle.net/1765/103402
Conference	7th International Conference on Language Resources and Evaluation, LREC 2010
Organisation	Erasmus MC: University Medical Center Rotterdam
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Rebholz-Schuhmann, D., Jimeno-Yepes, A. J., van Mulligen, E., Kang, N., Kors, J., Milward, D., … Hahn, U. (2010). The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers. In Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 (pp. 568–573). Retrieved from http://hdl.handle.net/1765/103402

Free Full Text ( Final Version , 584kb )

Additional Files
available online

The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers

Publication

Publication

About

The CALBC Silver Standard Corpus for Biomedical Named Entities - A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers

Publication

Publication

Workflow

Workflow

Add Content