Reputation: 1349
Say I am doing classification like below:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "cv",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
For each of the 10 folds, 10% of the data is being used for validation. For the best parameters that caret determines, I can use getTrainPerf(model)
to find the mean validation AUC over all 10 folds or model$resample
to get the individual values of AUC for each fold. However, I cannot get the AUC if the training data was put back into the same model. It would be nice if I can get the individual AUC values for each training set.
How can this information be extracted? I want to make sure my model is not overfit (data set I am working with is very small).
Thanks!
Upvotes: 1
Views: 2206
Reputation: 19716
As requested in the comments here is a custom function to evaluate the cross validation test error. I am not sure if it can be extracted from the caret train object.
After running the caret train extract the folds for the best tune:
library(tidyverse)
model$bestTune %>%
left_join(model$pred) %>%
select(rowIndex, Resample) %>%
mutate(Resample = as.numeric(gsub(".*(\\d$)", "\\1", Resample)),
Resample = ifelse(Resample == 0, 10, Resample)) %>%
arrange(rowIndex) -> resamples
Construct a cross validation function that will use the same folds as caret did:
library(xgboost)
train <- my_data[,!names(my_data)%in% "Class"]
label <- as.numeric(my_data$Class) - 1
test_auc <- lapply(1:10, function(x){
model <- xgboost(data = data.matrix(train[resamples[,2] != x,]),
label = label[resamples[,2] != x],
nrounds = model$bestTune$nrounds,
max_depth = model$bestTune$max_depth,
gamma = model$bestTune$gamma,
colsample_bytree = model$bestTune$colsample_bytree,
objective = "binary:logistic",
eval_metric= "auc" ,
print_every_n = 50)
preds_train <- predict(model, data.matrix(train[resamples[,2] != x,]))
preds_test <- predict(model, data.matrix(train[resamples[,2] == x,]))
auc_train <- pROC::auc(pROC::roc(response = label[resamples[,2] != x], predictor = preds_train, levels = c(0, 1)))
auc_test <- pROC::auc(pROC::roc(response = label[resamples[,2] == x], predictor = preds_test, levels = c(0, 1)))
return(data.frame(fold = unique(resamples[resamples[,2] == x, 2]), auc_train, auc_test))
})
do.call(rbind, test_auc)
#output
fold auc_train auc_test
1 1 1 0.9909091
2 2 1 0.9797980
3 3 1 0.9090909
4 4 1 0.9629630
5 5 1 0.9363636
6 6 1 0.9363636
7 7 1 0.9181818
8 8 1 0.9636364
9 9 1 0.9818182
10 10 1 0.8888889
arrange(model$resample, Resample)
#output
ROC Sens Spec Resample
1 0.9909091 1.0000000 0.8000000 Fold01
2 0.9898990 0.9090909 0.8888889 Fold02
3 0.9909091 0.9090909 1.0000000 Fold03
4 0.9444444 0.8333333 0.8888889 Fold04
5 0.9545455 0.9090909 0.8000000 Fold05
6 0.9272727 1.0000000 0.7000000 Fold06
7 0.9181818 0.9090909 0.9000000 Fold07
8 0.9454545 0.9090909 0.8000000 Fold08
9 0.9909091 0.9090909 0.9000000 Fold09
10 0.8888889 0.9090909 0.7777778 Fold10
Why the test fold AUC from my function and caret are not the same I can not say. I am fairly sure the same parameters and folds were used. I can assume it has to do with the random seed. When I check the auc of caret test predictions I get the same output as caret:
model$bestTune %>%
left_join(model$pred) %>%
arrange(rowIndex) %>%
select(M, Resample, obs) %>%
mutate(Resample = as.numeric(gsub(".*(\\d$)", "\\1", Resample)),
Resample = ifelse(Resample == 0, 10, Resample),
obs = as.numeric(obs) - 1) %>%
group_by(Resample) %>%
do(auc = as.vector(pROC::auc(pROC::roc(response = .$obs, predictor = .$M)))) %>%
unnest()
#output
Resample auc
<dbl> <dbl>
1 1.00 0.991
2 2.00 0.990
3 3.00 0.991
4 4.00 0.944
5 5.00 0.955
6 6.00 0.927
7 7.00 0.918
8 8.00 0.945
9 9.00 0.991
10 10.0 0.889
But again I emphasize test error will tell you little and you should rely on train error. If you would like to bring the two closer than consider fiddling with gamma
, alpha
and lambda
parameters.
With a small data set I would still try to split train : test = 80 : 20 and use that independent test set to verify if the CV error is close to test error.
Upvotes: 1