How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, testing and validation?

Question

As per my understanding, cross_val_score, cross_val_predict, and cross_val_validate can use K-fold validation. This means that the training set is iteratively used in part as a training set and test set. However, I have not come across any information on how Validation is taken care of. It appears that the data is not divided into three sets- training, validation and test sets. How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, validation and testing?

Pablo · Accepted Answer

The cross_val_score is used to estimate model's accuracy in a more robust way than with just the typical train-test split. It does the same job, but repeating it many times. This "repetitions" can be done in many different ways: CV, repeated CV, LOO, etc. See 3.1.2 in sklearn User Guide

In case you need to crossvalidate hyperparameters, then you should run a nested cross validation, with one outer loop to estimate model's accuracy and one inner loop to get the best parameters. This inner CV loop will split the train set of the outer loop further in train and validation sets. The procedure should go something like:

Outer loop:
    Split train - test
    Inner loop:
       Fix parameters      
       Split train in train2 - validation
       Train with train2 set
       Score with validation set
       Repeat Inner loop for all parameters
   Train with train set and best parameters from inner loop
   Score with test 
   Repeat outer loop until CV ends
   Return test scores

Fortunately, sklearn alllows to nest a GridSearchCV inside a cross_val_score.

validation = GridSearchCV(estimator, param_grid)
score = cross_val_score(validation, X, y)

How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, testing and validation?

Answers (2)

Related Questions