Comparing and combining chunkers of biomedical text

Kang, Ning; van Mulligen, Erik; Kors, Jan

doi:10.1016/j.jbi.2010.10.005

N. Kang (Ning), E.M. van Mulligen (Erik) and J.A. Kors (Jan)

2011-04-01

Comparing and combining chunkers of biomedical text

Journal of Biomedical Informatics , Volume 44 - Issue 2 p. 354- 360

Text chunking is an essential pre-processing step in information extraction systems. No comparative studies of chunking systems, including sentence splitting, tokenization and part-of-speech tagging, are available for the biomedical domain. We compared the usability (ease of integration, speed, trainability) and performance of six state-of-the-art chunkers for the biomedical domain, and combined the chunker results in order to improve chunking performance. We investigated six frequently used chunkers: GATE chunker, Genia Tagger, Lingpipe, MetaMap, OpenNLP, and Yamcha. All chunkers were integrated into the Unstructured Information Management Architecture framework. The GENIA Treebank corpus was used for training and testing. Performance was assessed for noun-phrase and verb-phrase chunking. For both noun-phrase chunking and verb-phrase chunking, OpenNLP performed best (F-scores 89.7% and 95.7%, respectively), but differences with Genia Tagger and Yamcha were small. With respect to usability, Lingpipe and OpenNLP scored best. When combining the results of the chunkers by a simple voting scheme, the F-score of the combined system improved by 3.1 percentage point for noun phrases and 0.6 percentage point for verb phrases as compared to the best single chunker. Changing the voting threshold offered a simple way to obtain a system with high precision (and moderate recall) or high recall (and moderate precision). This study is the first to compare the performance of the whole chunking pipeline, and to combine different existing chunking systems. Several chunkers showed good performance, but OpenNLP scored best both in performance and usability. The combination of chunker results by a simple voting scheme can further improve performance and allows for different precision-recall settings.

Additional Metadata
Keywords	Chunking systems, Evaluation, GENIA corpus, Natural language processing, Shallow parsing, Simple voting scheme
Persistent URL	doi.org/10.1016/j.jbi.2010.10.005, hdl.handle.net/1765/23894
Journal	Journal of Biomedical Informatics
Grant	This work was funded by the European Commission 7th Framework Programme; grant id fp7/231727 - Collaborative Annotation of a Large Biomedical Corpus (CALBC)
Organisation	Erasmus MC: University Medical Center Rotterdam
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Kang, N., van Mulligen, E., & Kors, J. (2011). Comparing and combining chunkers of biomedical text. Journal of Biomedical Informatics, 44(2), 354–360. doi:10.1016/j.jbi.2010.10.005

Comparing and combining chunkers of biomedical text

Publication

Publication

About

Comparing and combining chunkers of biomedical text

Publication

Publication

Workflow

Workflow

Add Content