Supplementing claims data analysis using self-reported data to develop a probabilistic phenotype model for current smoking status

Reps, Jenna M.; Rijnbeek, Peter; Ryan, Patrick

doi:10.1016/j.jbi.2019.103264

Reps, J.M. (Jenna M.), P.R. Rijnbeek (Peter) and P.B. Ryan (Patrick)

2019-09-01

Supplementing claims data analysis using self-reported data to develop a probabilistic phenotype model for current smoking status

Journal of Biomedical Informatics , Volume 97

Objectives: Smoking status is poorly record in US claims data. IBM MarketScan Commercial is a claims database that can be linked to an additional health risk assessment with self-reported smoking status for a subset of 1,966,174 patients. We investigate whether this subset could be used to learn a smoking status phenotype model generalizable to all US claims data that calculates the probability of being a current smoker. Methods: 251,643 (12.8%) had self-reported their smoking status as ‘current smoker’. A regularized logistic regression model, the Current Risk of Smoking Status (CROSS), was trained using the subset of patients with self-reported smoking status. CROSS considered 53,027 candidate covariates including demographics and conditions/drugs/measurements/procedures/observations recorded in the prior 365 days, The CROSS phenotype model was validated across multiple other claims data. Results: The internal validation showed the CROSS model achieved an area under the receiver operating characteristic curve (AUC) of 0.76 and the calibration plots indicated it was well calibrated. The external validation across three US claims databases obtained AUCs ranging between 0.82 and 0.87 showing the model appears to be transportable across Claims data. Conclusion: CROSS predicts current smoking status based on the claims records in the prior year. CROSS can be readily implemented to any US insurance claims mapped to the OMOP common data model and will be a useful way to impute smoking status when conducting epidemiology studies where smoking is a known confounder but smoking status is not recorded. CROSS is available from https://github.com/OHDSI/StudyProtocolSandbox/tree/master/SmokingModel.

Additional Metadata
Keywords	Claims data, Imputation, Patient-level prediction, Probabilistic phenotype, Risk, Smoking
Persistent URL	doi.org/10.1016/j.jbi.2019.103264, hdl.handle.net/1765/119033
Journal	Journal of Biomedical Informatics
Organisation	Erasmus MC: University Medical Center Rotterdam
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Reps, J.M. (Jenna M.), Rijnbeek, P., & Ryan, P. (2019). Supplementing claims data analysis using self-reported data to develop a probabilistic phenotype model for current smoking status. Journal of Biomedical Informatics, 97. doi:10.1016/j.jbi.2019.103264

Supplementing claims data analysis using self-reported data to develop a probabilistic phenotype model for current smoking status

Publication

Publication

About

Supplementing claims data analysis using self-reported data to develop a probabilistic phenotype model for current smoking status

Publication

Publication

Workflow

Workflow

Add Content