Reputation: 900
I have built a glm model using R package "caret" and I'd like to assess its performance using RMSE. I notice that the two RMSEs are different and I wonder which one is the real RMSE?
Also, how can I extract each fold (5*5=25 in total) of the training data, test data, and predicted data (based on the optimal tuned parameter) from the model?
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
)
GLM.pred = predict(model_glm, subset(mydata, select = -hp))
RMSE(pred = GLM.pred, obs = mydata$hp) # 21.89
model_glm$results$RMSE # 32.16
Upvotes: 0
Views: 423
Reputation: 2213
With the following code, I get :
sqrt(mean((mydata$hp - predict(model_glm)) ^ 2))
[1] 21.89127
This suggests that the real is "RMSE(pred = GLM.pred, obs = mydata$hp)"
Also, you have
model_glm$resample$RMSE
[1] 28.30254 34.69966 25.55273 25.29981 40.78493 31.91056 25.05311 41.83223 26.68105 23.64629 27.98388 25.98827 45.26982 37.28214
[15] 38.13617 31.14513 23.35353 42.05274 34.04761 35.17733 28.28838 35.89639 21.42580 45.17860 29.13998
which is the RMSE for each of the 25 CV. Also, we have
mean(model_glm$resample$RMSE)
32.16515
So, the 32.16 is the average of the RMSE of the 25 CV. The 21.89 is the RMSE on the original dataset.
Upvotes: 2