Gakuo
Gakuo

Reputation: 855

How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, testing and validation?

As per my understanding, cross_val_score, cross_val_predict, and cross_val_validate can use K-fold validation. This means that the training set is iteratively used in part as a training set and test set. However, I have not come across any information on how Validation is taken care of. It appears that the data is not divided into three sets- training, validation and test sets. How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, validation and testing?

Upvotes: 0

Views: 1166

Answers (2)

Pablo
Pablo

Reputation: 733

The cross_val_score is used to estimate model's accuracy in a more robust way than with just the typical train-test split. It does the same job, but repeating it many times. This "repetitions" can be done in many different ways: CV, repeated CV, LOO, etc. See 3.1.2 in sklearn User Guide

In case you need to crossvalidate hyperparameters, then you should run a nested cross validation, with one outer loop to estimate model's accuracy and one inner loop to get the best parameters. This inner CV loop will split the train set of the outer loop further in train and validation sets. The procedure should go something like:

Outer loop:
    Split train - test
    Inner loop:
       Fix parameters      
       Split train in train2 - validation
       Train with train2 set
       Score with validation set
       Repeat Inner loop for all parameters
   Train with train set and best parameters from inner loop
   Score with test 
   Repeat outer loop until CV ends
   Return test scores

Fortunately, sklearn alllows to nest a GridSearchCV inside a cross_val_score.

validation = GridSearchCV(estimator, param_grid)
score = cross_val_score(validation, X, y)

Upvotes: 1

sentence
sentence

Reputation: 8933

cross_val_score does take care of validation in so far as the process splits the dataset into K parts (3 by default), and performs fitting and validation K times. Sklearn documentation talks about splitting the dataset into train/test set, but do not misunderstand the name. That test set is in fact a validation set.

By using cross_val_score you can tune model hyperparameters and get the best configuration.

Therefore, the general procedure should be to divide (by yourself) the dataset into a training set and a test set.

Use the training set for cross-validation (invoking cross_val_score), in order to tune model hyperparameters and get the best configuration.

Then use the test set to evaluate the model. Note that the test set should be large enough and representative of the population in order to get an unbiased estimate of the generalization error.

Upvotes: 0

Related Questions