Automatic identification of variables in epidemiological datasets using logic regression

Lorenz, Matthias W.; Abdi, Negin Ashtiani; Scheckenbach, Frank; Pflug, Anja; Bulbul, Alpaslan; Catapano, Alberico; Agewall, Stefan; Ezhov, Marat; Bots, Michiel; Kiechl, Stefan; Orth, Andreas; Norata, Giuseppe; Empana, Jean Philippe; Lin, Hung-Ju; McLachlan, Stela; Bokemark, Lena; Ronkainen, Kimmo; Amato, Mauro; Schminke, Ulf; Srinivasan, Sathanur R.; Lind, Lars; Kato, Akihiko; Dimitriadis, Chrystosomos; Przewlocki, Tadeusz; Okazaki, Shuhei; Stehouwer, Coen; Lazarevic, Tatjana; Willeit, Johann; Yanez, David N.; Steinmetz, helmuth; Sander, Dirk; Poppert, Holger; Desvarieux, Moise; Ikram, Arfan; Bevc, Sebastjan; Staub, Daniel; Sirtori, Cesare R.; Iglseder, Bernhard; Engström, G.; Tripepi, Giovanni; Beloqui, Oscar; Lee, Moo-Sik; Friera, Alfonsa; Xie, Wuxiang; Grigore, Liliana; Plichart, Matthieu; Su, Ta-Chen; Robertson, Christine M; Schmidt, Caroline; Tuomainen, Tomi-Pekka; Veglia, Fabrizio; Völzke, Henry; Nijpels, Giel; Jovanovic, Aleksandar; Willeit, Johann; Sacco, Ralph L.; Franco, Oscar; Hojs, Radovan; Uthoff, Heiko; Hedblad, Bo; Park, Hyun Woong; Suarez, Carmen; Zhao, Dong; Catapano, Alberico; Ducimetiere, P.; Chien, Kuo-Liong; Price, Jackie F.; Bergstrom, Goran; Kauhanen, Jussi; Tremoli, Elena; Dörr, Marcus; Berenson, Gerald; Papagianni, Aikaterini; Kablak-Ziembicka, Anna; Kitagawa, Kazuo; Dekker, Jacqueline; Stolic, Radojica; Polak, Joseph F.; Sitzer, Matthias; Bickel, Horst; Rundek, Tatjana; Hofman, Albert; Ekart, Robert; Frauchiger, Beat; Castelnuovo, Samuela; Rosvall, Maria; Zoccali, Carmine; Landecho, Manuel F.; Bae, Jang-Ho; Gabriel, Rafael; Liu, Jing; Baldassarre, Damiano; Kavousi, Maryam

doi:10.1186/s12911-017-0429-1

Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Additional Metadata
Keywords	Data management, Epidemiology, Logic regression, Meta-analysis
Persistent URL	doi.org/10.1186/s12911-017-0429-1, hdl.handle.net/1765/99627
Journal	B M C Medical Informatics and Decision Making
Organisation	Erasmus MC: University Medical Center Rotterdam
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Lorenz, M. W., Abdi, N.A. (Negin Ashtiani), Scheckenbach, F., Pflug, A., Bulbul, A., Catapano, A., … Kavousi, M. (2017). Automatic identification of variables in epidemiological datasets using logic regression. B M C Medical Informatics and Decision Making, 17(1). doi:10.1186/s12911-017-0429-1

Free Full Text ( Final Version , 1mb )

Automatic identification of variables in epidemiological datasets using logic regression

Publication

Publication

About

Automatic identification of variables in epidemiological datasets using logic regression

Publication

Publication

Workflow

Workflow

Add Content