hexolitemax
hexolitemax

Reputation: 45

Cross-validation using StratifiedKFold with an exogeneous group feature

Good morning/afternoon, I would like to use cross-validation in sklearn for the prediction of a continuous variable.

I have refered to the "Visualizing cross-validation behavior in scikit-learn" page to select the cross-validation method suited to my problem. https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py

I want to use StratifiedKFold but it does not provide a way to use a "stratifying" variable that is not the target variable ("class") as in the example below.

enter image description here

What I would like is to use the "group" variable to stratify instead.

Currently, what I do is this:

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

skf = StratifiedKFold(n_splits=5, 
                      shuffle = True,
                      random_state=57)
cross_val_score(regr, X, y, cv=skf.split(training,groups))

where regr is my regressor, X my features, y my target and groups a panda Series of my prefered "stratifying" variable. I have checked that skf.split(training,groups) provides splits suited to my needs, i.e., train and test sets where the original distribution of my groups is maintained.

However, I have no mean to check that the cross-validation have the behavior I am expecting. Am I correct? Can I check?

Upvotes: 1

Views: 693

Answers (0)

Related Questions