Automating classification of free-text electronic health records for epidemiological studies

Schuemie, Martijn; Sen, Elif; 't Jong, Geert; van Soest, E.M.; Sturkenboom, Miriam; Kors, Jan

doi:10.1002/pds.3205

M.J. Schuemie (Martijn), E.F. Sen (Elif), G.W. 't Jong (Geert), E.M. van Soest, M.C.J.M. Sturkenboom (Miriam) and J.A. Kors (Jan)

2012-06-01

Automating classification of free-text electronic health records for epidemiological studies

Pharmacoepidemiology and Drug Safety: an international journal , Volume 21 - Issue 6 p. 651- 658

Purpose: Increasingly, patient information is stored in electronic medical records, which could be reused for research. Often these records comprise unstructured narrative data, which are cumbersome to analyze. The authors investigated whether text mining can make these data suitable for epidemiological studies and compared a concept recognition approach and a range of machine learning techniques that require a manually annotated training set. The authors show how this training set can be created with minimal effort by using a broad database query. Methods: The approaches were tested on two data sets: a publicly available set of English radiology reports for which International Classification of Diseases, Ninth Revision, Clinical Modification code needed to be assigned and a set of Dutch GP records that needed to be classified as either liver disorder cases or noncases. Performance was tested against a manually created gold standard. Results: The best overall performance was achieved by a combination of a manually created filter for removing negations and speculations and rule learning algorithms such as RIPPER, with high scores on both the radiology reports (positive predictive value=0.88, sensitivity=0.85, specificity=1.00) and the GP records (positive predictive value=0.89, sensitivity =0.91, specificity =0.76). Conclusions: Although a training set still needs to be created manually, text mining can help reduce the amount of manual work needed to incorporate narrative data in an epidemiological study and will make the data extraction more reproducible. An advantage of machine learning is that it is able to pick up specific language use, such as abbreviations and synonyms used by physicians.

Additional Metadata
Keywords	Case definition, Free text, Machine learning, Method, Text mining
Persistent URL	doi.org/10.1002/pds.3205, hdl.handle.net/1765/66260
Journal	Pharmacoepidemiology and Drug Safety: an international journal
Grant	This work was funded by the European Commission 7th Framework Programme; grant id fp7/215847 - Exploring and understanding adverse drug reactions by integrative mining of clinical records and biomedical knowledge (EU-ADR)
Organisation	Erasmus MC: University Medical Center Rotterdam
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Schuemie, M., Sen, E., 't Jong, G., van Soest, E. M., Sturkenboom, M., & Kors, J. (2012). Automating classification of free-text electronic health records for epidemiological studies. Pharmacoepidemiology and Drug Safety: an international journal, 21(6), 651–658. doi:10.1002/pds.3205

Automating classification of free-text electronic health records for epidemiological studies

Publication

Publication

About

Automating classification of free-text electronic health records for epidemiological studies

Publication

Publication

Workflow

Workflow

Add Content