Reputation: 96478
I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl
:
method
index
and the interplay between trainControl
and the data splitting functions in caret (e.g. createDataPartition
, createResample
, createFolds
and createMultiFolds
)
To better frame my questions, let me use the following example from the documentation:
data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
My questions are:
If I use createDataPartition
(which I assume that does stratified bootstrapping), as in the above example, and I pass the result as index
to trainControl
do I need to use LGOCV
as the method in my call trainControl
? If I use another one (e.g. cv
) What difference would it make? In my head, once you fix index
, you are essentially choosing the type of cross-validation, so I am not sure what role method
plays if you use index
.
What is the difference between createDataPartition
and createResample
? Is it that createDataPartition
does stratified bootstrapping, while createResample
doesn't?
3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?
tmp <- createFolds(logBBB, k=10, list=TRUE, times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
Upvotes: 15
Views: 5123
Reputation: 121626
If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions.
methods <- c('boot', 'boot632', 'cv',
'repeatedcv', 'LOOCV', 'LGOCV')
I create my index:
n <- 100
tmp <- createDataPartition(logBBB,p = .8, times = n)
I apply trainControl
for my list of method, and I remove index from result since it is common to all my methods.
ll <- lapply(methods,function(x)
trControl = trainControl(method = x, index = tmp))
ll <- sapply(ll,'[<-','index', NULL)
Hence my ll is :
[,1] [,2] [,3] [,4] [,5] [,6]
method "boot" "boot632" "cv" "repeatedcv" "LOOCV" "LGOCV"
number 25 25 10 10 25 25
repeats 25 25 1 1 25 25
verboseIter FALSE FALSE FALSE FALSE FALSE FALSE
returnData TRUE TRUE TRUE TRUE TRUE TRUE
returnResamp "final" "final" "final" "final" "final" "final"
savePredictions FALSE FALSE FALSE FALSE FALSE FALSE
p 0.75 0.75 0.75 0.75 0.75 0.75
classProbs FALSE FALSE FALSE FALSE FALSE FALSE
summaryFunction ? ? ? ? ? ?
selectionFunction "best" "best" "best" "best" "best" "best"
preProcOptions List,3 List,3 List,3 List,3 List,3 List,3
custom NULL NULL NULL NULL NULL NULL
timingSamps 0 0 0 0 0 0
predictionBounds Logical,2 Logical,2 Logical,2 Logical,2 Logical,2 Logical,2
Upvotes: 1