mlr : Avoiding data leakage in cross validation

Question

I am using mlr for my machine learning project. I am using 5-fold cross-validation repeated 5 times and a number of different algorithms. I am imputing the missing data using MICE (multiple imputation for chained equations). I also need to standardize the numerical data.

Everything I have read says that to avoid data leakage I must perform any data dependent steps, such as standardization, within the cross validation loop. But how can I achieve this in mlr when, for example, the normalizeFeatures method applies to the whole task?

This is what I have (the imputation with mice is not shown as that is done prior to calling this code - perhaps incorrectly):

surv.task <- makeSurvTask(id = task_id, data = dataset, target = c(time_var, status_var))
surv.task <- normalizeFeatures(surv.task)
surv.task <- createDummyFeatures(surv.task)
surv.measures = list(cindex)

ridge.lrn  <-  makeLearner(cl="surv.cvglmnet", id = "ridge", predict.type="response", alpha = 0, nfolds=5)
cboostcv.lrn <- makeLearner(cl="surv.cv.CoxBoost", id = "CoxBoostCV", predict.type="response")

outer = makeResampleDesc("RepCV", reps=num_iters, folds=5, stratify=TRUE)
learners = list(ridge.lrn, cboostcv.lrn)
bmr = benchmark(learners, surv.task, outer, surv.measures, show.info = TRUE)

How can I call normalizeFeatures (or do imputation) within the cross-validation loop?

jakob-r · Accepted Answer

This is what the wrappers in mlr are there for or you can use the package mlrCPO which has pipelines that can be resampled. Basically you define a pipeline using the mlrCPO pipeop %>>%. Every pipeop you put before the learner will be applied directly before the training but after the train test split.

library(mlrCPO)

surv.task <- mlr::lung.task
surv.measures = list(cindex)

ridge.lrn  <-  makeLearner(cl="surv.cvglmnet", id = "ridge", predict.type="response", alpha = 0, nfolds=5)
cboostcv.lrn <- makeLearner(cl="surv.cv.CoxBoost", id = "CoxBoostCV", predict.type="response")

my_pipeline <- cpoScale() %>>% cpoDummyEncode() 
ridge.lrn <- my_pipeline %>>% ridge.lrn
cboostcv.lrn <- my_pipeline %>>% cboostcv.lrn

outer = makeResampleDesc("RepCV", reps=2, folds=5, stratify=TRUE)
learners = list(ridge.lrn, cboostcv.lrn)
bmr = benchmark(learners, surv.task, outer, surv.measures, show.info = TRUE)

mlr : Avoiding data leakage in cross validation

Answers (1)

Related Questions