anotherplanet
anotherplanet

Reputation:

How to persist the same folds when doing cross-validation across multiple models in scikit-learn?

I'm doing hyperparameter tuning across multiple models and comparing the results. The hyperparameters of each model are chosen by 5-fold cross-validation. I'm using the sklearn.model_selection.KFold(n_splits=5, shuffle=True) function to get a fold generator.

After checking the documentation on KFold and the source code of some models, I suspect a new set of folds is created for each model. I want to make things more fair and use the same (initially random) folds for all the models I'm tuning. Is there a way to do this in scikit-learn?

As a related question, does it make sense to use the same folds to obtain this fair comparison I'm trying to do?

Upvotes: 2

Views: 695

Answers (2)

Y.P
Y.P

Reputation: 355

The goal of cross-validation is to obtain a representative measure of the accuracy in the test set. The more fold you have the more accurate your metric will be.

If you are using 5 or 10 fold cross-validation to compare different sets of hyperparameters, you don't have to use the exact same splits to compare your models. The average accuracy of all folds will give you a good idea how the model is performing and will allow you to compare them.

Upvotes: 0

Ryan Volpi
Ryan Volpi

Reputation: 111

You have two options:

  1. Shuffle your data at the begining, then use Kfold with shuffle=False.

  2. Set the parameter random_state equal to the same integer each time you perform KFold.

Either option should result in using the same folds when you repeat KFold. See the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

This approach makes logical sense to me, but I wouldn't expect it to make a significant difference. Perhaps someone else can give a more detailed explanation of the advantages / disadvantages.

Upvotes: 1

Related Questions