Fra_Ve
Fra_Ve

Reputation: 1210

Resampling based performance measure in caret

I perform a penalized logistic regression and I train a model with caret (glmnet).

model_fit <- train(Data[,-1], Data[,1],
               method = "glmnet",
               family="binomial",
               metric = "ROC",
               maximize="TRUE",
               trControl = ctrl,
               preProc = c("center", "scale"),
               tuneGrid=expand.grid(.alpha=0.5,.lambda=lambdaSeq)
               )

According to the caret documentation, the function train "[...] calculates a resampling based performance measure" and "Across each data set, the performance of held-out samples is calculated and the mean and standard deviation is summarized for each combination."

results is "A data frame" (containing) "the training error rate and values of the tuning parameters."

Is model_fit$results$ROC a vector (with size equal to the size of my tuning parameter lambda) of the mean of the performance measure across resampling? (And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda?)

Upvotes: 0

Views: 832

Answers (1)

desertnaut
desertnaut

Reputation: 60337

Is model_fit$results$ROC a vector (with size equal to the size of my tuning parameter lambda) of the mean of the performance measure across resampling?

It is; to be precise, the length will be equal to the number of rows of your tuneGrid, which here it happens to coincide with the length of your lambdaSeq (since the only other parameter, alpha, is being held constant).

Here is a quick example, adapted from the caret docs (it is with gbm and Accuracy metric, but the idea is the same):

library(caret)
library(mlbench)
data(Sonar)

set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing  <- Sonar[-inTraining,]

fitControl <- trainControl(method = "cv",
                           number = 5)

set.seed(825)

gbmGrid <-  expand.grid(interaction.depth = 3, 
                        n.trees = (1:3)*50, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)

gbmFit1 <- train(Class ~ ., data = training, 
                 method = "gbm", 
                 trControl = fitControl,
                 tuneGrid = gbmGrid,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)

Here, gbmGrid has 3 rows, i.e. it is consisted only of three (3) different values of n.trees with the other parameters held constant; hence, the corresponding gbmFit1$results$Accuracy will be a vector of length 3:

gbmGrid
#   interaction.depth n.trees shrinkage n.minobsinnode
# 1                 3      50       0.1             20
# 2                 3     100       0.1             20
# 3                 3     150       0.1             20

gbmFit1$results
#   shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa AccuracySD   KappaSD
# 1       0.1                 3             20      50 0.7450672 0.4862194 0.05960941 0.1160537
# 2       0.1                 3             20     100 0.7829704 0.5623801 0.05364031 0.1085451
# 3       0.1                 3             20     150 0.7765188 0.5498957 0.05263735 0.1061387

gbmFit1$results$Accuracy
# [1] 0.7450672 0.7829704 0.7765188

Each of the 3 Accuracy values returned is the result of the metric in the validation folds of the 5-fold cross validation we have used as a resampling technique; more precisely, it is the mean of the validation accuracies computed in these 5 folds (and you can see that there is an AccuracySD column, containing also its standard deviation).

And NOT the performance measure computed over the whole sample after re-estimating the model over the whole sample for each value of lambda?

Correct, it is not that.

Upvotes: 2

Related Questions