Reputation: 855
As per my understanding, cross_val_score, cross_val_predict, and cross_val_validate
can use K-fold validation
. This means that the training set is iteratively used in part as a training set and test set. However, I have not come across any information on how Validation is taken care of. It appears that the data is not divided into three sets- training, validation and test sets. How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, validation and testing?
Upvotes: 0
Views: 1166
Reputation: 733
The cross_val_score is used to estimate model's accuracy in a more robust way than with just the typical train-test split. It does the same job, but repeating it many times. This "repetitions" can be done in many different ways: CV, repeated CV, LOO, etc. See 3.1.2 in sklearn User Guide
In case you need to crossvalidate hyperparameters, then you should run a nested cross validation, with one outer loop to estimate model's accuracy and one inner loop to get the best parameters. This inner CV loop will split the train set of the outer loop further in train and validation sets. The procedure should go something like:
Outer loop:
Split train - test
Inner loop:
Fix parameters
Split train in train2 - validation
Train with train2 set
Score with validation set
Repeat Inner loop for all parameters
Train with train set and best parameters from inner loop
Score with test
Repeat outer loop until CV ends
Return test scores
Fortunately, sklearn alllows to nest a GridSearchCV inside a cross_val_score.
validation = GridSearchCV(estimator, param_grid)
score = cross_val_score(validation, X, y)
Upvotes: 1
Reputation: 8933
cross_val_score
does take care of validation in so far as the process splits the dataset into K
parts (3 by default), and performs fitting and validation K
times. Sklearn documentation talks about splitting the dataset into train/test set, but do not misunderstand the name. That test set is in fact a validation set.
By using cross_val_score
you can tune model hyperparameters and get the best configuration.
Therefore, the general procedure should be to divide (by yourself) the dataset into a training set and a test set.
Use the training set for cross-validation (invoking cross_val_score
), in order to tune model hyperparameters and get the best configuration.
Then use the test set to evaluate the model. Note that the test set should be large enough and representative of the population in order to get an unbiased estimate of the generalization error.
Upvotes: 0