Reputation: 333
I have here a training set, a validation set and a test set. I want to know how can I train a model over different parameters (defined by a grid on caret), but with the classification metrics calculated over the validation set?
If I have the following syntax...
TARGET <- iris$Species
trainX <- iris[,-5]
ctrl <- trainControl(method = "cv")
svm.tune <- train(x=trainX,
y= TARGET,
method = "svmRadial",
tuneLength = 9,
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl)
svm.tune
Is there a direct form to obtain the metrics over the validation set as the print of svm.tune
? Or should I use 'predict' for each considered fit by hand?
As I'm new to caret grammar, I know how to obtain the metrics for cross-validation, but I would like to redirect the computations to this validation set. Which parameters should I use?
EDIT: Is there a way to show the classification metrics for each set of parameters of the grid using a validation set instead of cross-validation?
Upvotes: 1
Views: 677
Reputation: 1549
You can do this by specifying index
and indexOut
arguments to trainControl
. I will use an example on the diamonds
data from the ggplot2
package to highlight.
library(caret)
data(diamonds, package = "ggplot2")
# create a mock training and validation set
training = diamonds[1:10000,]
validation = diamonds[10001:11000,]
Then use the createFolds
function to create some cross validation folds for each model fit. The default returnTrain = FALSE
would normally return hold out rather than keep in hence it's specification as TRUE.
trainIndex = createFolds(training$price, returnTrain = TRUE)
Now we will create one data frame that contains both the training and validation sets, and create a list of hold out indicies of equal length to the number of training folds. Note these indicies just correspond to the rows of my data that are the validation set.
dat = rbind(training,validation)
valIndex = lapply(trainIndex,function(i) 10001:11000)
Then in specification of the trainControl
object we pass these two lists of indicies to the arguments index
and indexOut
, the indicies to fit and test respectively and train our model. ("lm" here for speed)
trControl = trainControl(method = "cv",
index = trainIndex,
indexOut = valIndex)
train(price~., method = "lm", data = dat, trControl = trControl)
## Linear Regression
##
## 11000 samples
## 9 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 8999, 8999, 9000, 9000, 8999, 9000, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 508.0062 0.9539221 2.54004 0.0002948073
You can convince yourself that you are indeed doing what you aim to, either by keeping all the resampling info and testing one of them by fitting manually (you know the indicies used for fitting so can do this). Or maybe just seeing that if we only use the training data we get different resampling results. (Since the folds were initially fixed then we would expect the same if it wasn't using the validation set, got rid of the randomness in rerunning train
)
train(price~., method = "lm", data = training,trControl = trainControl(
method = "cv", index = trainIndex
))
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 337.6474 0.9074643 9.916053 0.008115761
Hope that helps.
Edit:
OK just noticed OP asked about classification example, however the answer holds true for both.
Upvotes: 2