Ricky
Ricky

Reputation: 4686

Repeated crossvalidation on subset of data in MLR

I am trying to set up an mlr classification task where 75% of the data is to be used for training, and this 75% will be resampled by repeated cross validation.

My setup of the task is as follows

pred.Bin.Task <- makeClassifTask(id="CountyCrime", data=df, target="count.bins")
preProc.Task <- normalizeFeatures(pred.Bin.Task, method="range")
rdesc <- makeResampleDesc("RepCV", reps=3, folds=5)
inTraining <- caret::createDataPartition(df$count.bins, p = .75, list = FALSE)

But I couldn't get the resampling to work. When I do lda.train <- resample("classif.lda", preProc.Task, rdesc, subset=inTraining)

I get the error

Error in setHyperPars2.Learner(learner, insert(par.vals, args)) : 
  classif.lda: Setting parameter subset without available description object!
You can switch off this check by using configureMlr!

Training without subsetting (i.e. lda.train <- resample("classif.lda", preProc.Task, rdesc) ) works.

I'd rather have the whole data rather than just the training data in the Task, so that when I do prediction with the holdout data I don't need to pre-process and resubmit new data. Any suggestions on how I can get the subsetting right?

Upvotes: 1

Views: 527

Answers (1)

Lars Kotthoff
Lars Kotthoff

Reputation: 109242

The cause of the error is that the resample function doesn't have a subset argument, so it's passed through to the learner, which does not have such an argument either.

mlr's resample descriptions don't allow you to keep data completely separate (i.e. not use it at all during training) as you're trying to do. However, you can use the subsetTask function to partition the data without having to preprocess again:

preproc.task.train = subsetTask(preproc.task, inTraining)
resample("classif.lda", preproc.task.train, rdesc)

Upvotes: 3

Related Questions