Reputation: 1210
I perform a penalized logistic regression and I train a model with caret (glmnet).
model_fit <- train(Data[,-1], Data[,1],
method = "glmnet",
family="binomial",
metric = "ROC",
maximize="TRUE",
trControl = ctrl,
preProc = c("center", "scale"),
tuneGrid=expand.grid(.alpha=0.5,.lambda=lambdaSeq)
)
According to the caret documentation, the function train
"[...] calculates a resampling based performance measure" and "Across each data set, the performance of held-out samples is calculated and the mean and standard deviation is summarized for each combination."
results
is "A data frame" (containing) "the training error rate and values of the tuning parameters."
Is model_fit$results$ROC
a vector (with size equal to the size of my tuning parameter lambda
) of the mean of the performance measure across resampling? (And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda
?)
Upvotes: 0
Views: 832
Reputation: 60337
Is
model_fit$results$ROC
a vector (with size equal to the size of my tuning parameterlambda
) of the mean of the performance measure across resampling?
It is; to be precise, the length will be equal to the number of rows of your tuneGrid
, which here it happens to coincide with the length of your lambdaSeq
(since the only other parameter, alpha
, is being held constant).
Here is a quick example, adapted from the caret
docs (it is with gbm
and Accuracy
metric, but the idea is the same):
library(caret)
library(mlbench)
data(Sonar)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
fitControl <- trainControl(method = "cv",
number = 5)
set.seed(825)
gbmGrid <- expand.grid(interaction.depth = 3,
n.trees = (1:3)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
gbmFit1 <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
Here, gbmGrid
has 3 rows, i.e. it is consisted only of three (3) different values of n.trees
with the other parameters held constant; hence, the corresponding gbmFit1$results$Accuracy
will be a vector of length 3:
gbmGrid
# interaction.depth n.trees shrinkage n.minobsinnode
# 1 3 50 0.1 20
# 2 3 100 0.1 20
# 3 3 150 0.1 20
gbmFit1$results
# shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa AccuracySD KappaSD
# 1 0.1 3 20 50 0.7450672 0.4862194 0.05960941 0.1160537
# 2 0.1 3 20 100 0.7829704 0.5623801 0.05364031 0.1085451
# 3 0.1 3 20 150 0.7765188 0.5498957 0.05263735 0.1061387
gbmFit1$results$Accuracy
# [1] 0.7450672 0.7829704 0.7765188
Each of the 3 Accuracy
values returned is the result of the metric in the validation folds of the 5-fold cross validation we have used as a resampling technique; more precisely, it is the mean of the validation accuracies computed in these 5 folds (and you can see that there is an AccuracySD
column, containing also its standard deviation).
And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda?
Correct, it is not that.
Upvotes: 2