How can I use caret to train models and give the classification metrics over a validation set?

Question

I have here a training set, a validation set and a test set. I want to know how can I train a model over different parameters (defined by a grid on caret), but with the classification metrics calculated over the validation set?

If I have the following syntax...

TARGET <- iris$Species
trainX <- iris[,-5]

ctrl <- trainControl(method = "cv")

svm.tune <- train(x=trainX,
              y= TARGET,
              method = "svmRadial",   
              tuneLength = 9,                    
              preProc = c("center","scale"),
              metric="ROC",
              trControl=ctrl)

svm.tune

Is there a direct form to obtain the metrics over the validation set as the print of svm.tune? Or should I use 'predict' for each considered fit by hand?

As I'm new to caret grammar, I know how to obtain the metrics for cross-validation, but I would like to redirect the computations to this validation set. Which parameters should I use?

EDIT: Is there a way to show the classification metrics for each set of parameters of the grid using a validation set instead of cross-validation?

jamieRowen · Accepted Answer

You can do this by specifying index and indexOut arguments to trainControl. I will use an example on the diamonds data from the ggplot2 package to highlight.

library(caret)
data(diamonds, package = "ggplot2")
# create a mock training and validation set
training = diamonds[1:10000,]
validation = diamonds[10001:11000,]

Then use the createFolds function to create some cross validation folds for each model fit. The default returnTrain = FALSE would normally return hold out rather than keep in hence it's specification as TRUE.

trainIndex = createFolds(training$price, returnTrain = TRUE)

Now we will create one data frame that contains both the training and validation sets, and create a list of hold out indicies of equal length to the number of training folds. Note these indicies just correspond to the rows of my data that are the validation set.

dat = rbind(training,validation)
valIndex = lapply(trainIndex,function(i) 10001:11000)

Then in specification of the trainControl object we pass these two lists of indicies to the arguments index and indexOut, the indicies to fit and test respectively and train our model. ("lm" here for speed)

trControl = trainControl(method = "cv",
                         index = trainIndex,
                         indexOut = valIndex)
train(price~., method = "lm", data = dat, trControl = trControl)
## Linear Regression 
##
## 11000 samples
##     9 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
##
## Summary of sample sizes: 8999, 8999, 9000, 9000, 8999, 9000, ... 
##
## Resampling results
##
##   RMSE      Rsquared   RMSE SD  Rsquared SD 
##   508.0062  0.9539221  2.54004  0.0002948073

You can convince yourself that you are indeed doing what you aim to, either by keeping all the resampling info and testing one of them by fitting manually (you know the indicies used for fitting so can do this). Or maybe just seeing that if we only use the training data we get different resampling results. (Since the folds were initially fixed then we would expect the same if it wasn't using the validation set, got rid of the randomness in rerunning train)

train(price~., method = "lm", data = training,trControl = trainControl(
  method = "cv", index = trainIndex
))
## Resampling results
##
##   RMSE      Rsquared   RMSE SD   Rsquared SD
##   337.6474  0.9074643  9.916053  0.008115761

Hope that helps.

Edit:

OK just noticed OP asked about classification example, however the answer holds true for both.

How can I use caret to train models and give the classification metrics over a validation set?

Answers (1)

Related Questions