Reputation: 1035
Caret lets you set a custom training and validation set in train with the options index
and indexOut
, but when the obtained model is applied over the validation set, and its performance measured, this is very different to the provided by the model itself:
library(caret)
library(Metrics)
set.seed(123)
index_on <- 1:16
index_out <- 17:32
fit <- train(mpg~wt+qsec,
mtcars,
method = "glm",
metric = "RMSE",
trControl = trainControl(method="cv",
index = list(index_on),
indexOut = list(index_out))
)
fit$results$RMSE
rmse(mtcars[index_out, "mpg"], predict(fit, mtcars[index_out,]))
As you can see this produces different values for the performance when it is obtained from the train object or calculated with predict directly:
[1] 3.612743
[1] 3.079445
Is this a bug? am I missing something here?
Upvotes: 3
Views: 597
Reputation: 1035
I have been investigating and it looks like that internally train calculates the right expected model and calculates the performance with that model, but it returns a different model instead. It return one that is the obtained training ALL the data (not only the "index" data).
You can see that with this code:
set.seed(123)
fit_3 <- train(mpg~wt+qsec,
data=mtcars,
method = "glm",
metric = "RMSE",
trControl = trainControl(method="none")
)
rmse(mtcars[index_out, "mpg"], predict(fit_3, mtcars[index_out,]))
which produces:
[1] 3.079445
I'm using the last current caret version (caret_6.0-75 at the moment). It was pretty clear that this is a but and I was going to report it when I found it is an open bug already:
https://github.com/topepo/caret/issues/348
Upvotes: 1