Reputation: 871
I am using mlr for my machine learning project. I am using 5-fold cross-validation repeated 5 times and a number of different algorithms. I am imputing the missing data using MICE (multiple imputation for chained equations). I also need to standardize the numerical data.
Everything I have read says that to avoid data leakage I must perform any data dependent steps, such as standardization, within the cross validation loop. But how can I achieve this in mlr when, for example, the normalizeFeatures method applies to the whole task?
This is what I have (the imputation with mice is not shown as that is done prior to calling this code - perhaps incorrectly):
surv.task <- makeSurvTask(id = task_id, data = dataset, target = c(time_var, status_var))
surv.task <- normalizeFeatures(surv.task)
surv.task <- createDummyFeatures(surv.task)
surv.measures = list(cindex)
ridge.lrn <- makeLearner(cl="surv.cvglmnet", id = "ridge", predict.type="response", alpha = 0, nfolds=5)
cboostcv.lrn <- makeLearner(cl="surv.cv.CoxBoost", id = "CoxBoostCV", predict.type="response")
outer = makeResampleDesc("RepCV", reps=num_iters, folds=5, stratify=TRUE)
learners = list(ridge.lrn, cboostcv.lrn)
bmr = benchmark(learners, surv.task, outer, surv.measures, show.info = TRUE)
How can I call normalizeFeatures (or do imputation) within the cross-validation loop?
Upvotes: 3
Views: 352
Reputation: 7282
This is what the wrappers in mlr are there for or you can use the package mlrCPO which has pipelines that can be resampled.
Basically you define a pipeline using the mlrCPO pipeop %>>%
. Every pipeop you put before the learner will be applied directly before the training but after the train test split.
library(mlrCPO)
surv.task <- mlr::lung.task
surv.measures = list(cindex)
ridge.lrn <- makeLearner(cl="surv.cvglmnet", id = "ridge", predict.type="response", alpha = 0, nfolds=5)
cboostcv.lrn <- makeLearner(cl="surv.cv.CoxBoost", id = "CoxBoostCV", predict.type="response")
my_pipeline <- cpoScale() %>>% cpoDummyEncode()
ridge.lrn <- my_pipeline %>>% ridge.lrn
cboostcv.lrn <- my_pipeline %>>% cboostcv.lrn
outer = makeResampleDesc("RepCV", reps=2, folds=5, stratify=TRUE)
learners = list(ridge.lrn, cboostcv.lrn)
bmr = benchmark(learners, surv.task, outer, surv.measures, show.info = TRUE)
Upvotes: 3