Reputation: 45
Good morning/afternoon, I would like to use cross-validation in sklearn for the prediction of a continuous variable.
I have refered to the "Visualizing cross-validation behavior in scikit-learn" page to select the cross-validation method suited to my problem. https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py
I want to use StratifiedKFold but it does not provide a way to use a "stratifying" variable that is not the target variable ("class") as in the example below.
What I would like is to use the "group" variable to stratify instead.
Currently, what I do is this:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
skf = StratifiedKFold(n_splits=5,
shuffle = True,
random_state=57)
cross_val_score(regr, X, y, cv=skf.split(training,groups))
where regr is my regressor, X my features, y my target and groups a panda Series of my prefered "stratifying" variable. I have checked that skf.split(training,groups) provides splits suited to my needs, i.e., train and test sets where the original distribution of my groups is maintained.
However, I have no mean to check that the cross-validation have the behavior I am expecting. Am I correct? Can I check?
Upvotes: 1
Views: 693