motipai
motipai

Reputation: 328

How does caret split the data in trainControl?

I need to split the data in a cross-validation structured sequentially, such as:

fold-1 with observations with index from 1 to 10, fold-2 with observations with index from 11 to 20 and so on...

Does any of the methods in trainControl() from caret doing it sequentially? I suppose the "cv" method split the data in this way, but nothing very clear in the caret's documentation to guarantee that.

Upvotes: 1

Views: 470

Answers (1)

StupidWolf
StupidWolf

Reputation: 47008

You can provide the folds, using indexOut= argument. check out the help page. Below I use iris as an example, i cannot run it sequentially because the data is ordered by Species, so i randomised it first:

library(caret)
dat = iris[sample(nrow(iris)),]

I create the folds, below is based on a 10 fold cross validation, so each fold takes in 1/10 of the total number of rows:

idx = (1:nrow(dat) - 1) %/% (nrow(dat) / 10)
Folds = split(1:nrow(dat),idx)

We can look at the assignment of the indices:

Folds[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Folds[[2]]
 [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Then run train() with this:

model = train(Species ~.,method="rf",data=dat,
trControl=trainControl(method="cv",indexOut=Folds))


model
Random Forest 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     1.0000000  1.0000000
  3     1.0000000  1.0000000
  4     0.9933333  0.9895833

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Upvotes: 1

Related Questions