Improved high-dimensional prediction with Random Forests by the use of co-data

te Beest, Dennis E.; Mes, Steven; Wilting, Saskia; Brakenhoff, Ruud; van de Wiel, Mark

doi:10.1186/s12859-017-1993-1

te Beest, D.E. (Dennis E.), S.W. Mes (Steven), S.M. Wilting (Saskia), R. Brakenhoff (Ruud) and M.A. van de Wiel (Mark)

2017-12-28

Improved high-dimensional prediction with Random Forests by the use of co-data

B M C Bioinformatics , Volume 18 - Issue 1

Background: Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting. Results: Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. Conclusion: The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

Additional Metadata
Keywords	Classification, DNA copy number, Gene expression, Methylation, Prior information, Random forest
Persistent URL	doi.org/10.1186/s12859-017-1993-1, hdl.handle.net/1765/103785
Journal	B M C Bioinformatics
Grant	This work was funded by the European Commission 7th Framework Programme; grant id erc/322986 - Molecular Self Screening for Cervical Cancer Prevention (MASS-CARE), This work was funded by the European Commission 7th Framework Programme; grant id h2020/689715 - Big Data and models for personalized Head and Neck Cancer Decision support (BD2Decide)
Organisation	Erasmus MC: University Medical Center Rotterdam
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	te Beest, D.E. (Dennis E.), Mes, S., Wilting, S., Brakenhoff, R., & van de Wiel, M. (2017). Improved high-dimensional prediction with Random Forests by the use of co-data. B M C Bioinformatics, 18(1). doi:10.1186/s12859-017-1993-1

Full Text ( Final Version , 1mb )

Additional Files
unpaywall Final Version

Improved high-dimensional prediction with Random Forests by the use of co-data

Publication

Publication

About

Improved high-dimensional prediction with Random Forests by the use of co-data

Publication

Publication

Workflow

Workflow

Add Content