Objectives: Smoking status is poorly record in US claims data. IBM MarketScan Commercial is a claims database that can be linked to an additional health risk assessment with self-reported smoking status for a subset of 1,966,174 patients. We investigate whether this subset could be used to learn a smoking status phenotype model generalizable to all US claims data that calculates the probability of being a current smoker. Methods: 251,643 (12.8%) had self-reported their smoking status as ‘current smoker’. A regularized logistic regression model, the Current Risk of Smoking Status (CROSS), was trained using the subset of patients with self-reported smoking status. CROSS considered 53,027 candidate covariates including demographics and conditions/drugs/measurements/procedures/observations recorded in the prior 365 days, The CROSS phenotype model was validated across multiple other claims data. Results: The internal validation showed the CROSS model achieved an area under the receiver operating characteristic curve (AUC) of 0.76 and the calibration plots indicated it was well calibrated. The external validation across three US claims databases obtained AUCs ranging between 0.82 and 0.87 showing the model appears to be transportable across Claims data. Conclusion: CROSS predicts current smoking status based on the claims records in the prior year. CROSS can be readily implemented to any US insurance claims mapped to the OMOP common data model and will be a useful way to impute smoking status when conducting epidemiology studies where smoking is a known confounder but smoking status is not recorded. CROSS is available from https://github.com/OHDSI/StudyProtocolSandbox/tree/master/SmokingModel.

Additional Metadata
Keywords Claims data, Imputation, Patient-level prediction, Probabilistic phenotype, Risk, Smoking
Persistent URL dx.doi.org/10.1016/j.jbi.2019.103264, hdl.handle.net/1765/119033
Journal Journal of Biomedical Informatics
Citation
Reps, J.M. (Jenna M.), Rijnbeek, P.R, & Ryan, P.B. (2019). Supplementing claims data analysis using self-reported data to develop a probabilistic phenotype model for current smoking status. Journal of Biomedical Informatics, 97. doi:10.1016/j.jbi.2019.103264