sw007sw
sw007sw

Reputation: 161

Sklearn: Cross validation for grouped data

I am trying to implement a cross validation scheme on grouped data. I was hoping to use the GroupKFold method, but I keep getting an error. what am I doing wrong? The code (slightly different from the one I used--I had different data so I had a larger n_splits, but everythign else is the same)

from sklearn import metrics
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import GroupKFold
from sklearn.grid_search import GridSearchCV
from xgboost import XGBRegressor
#generate data
x=np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13])
y= np.array([1,2,3,4,5,6,7,1,2,3,4,5,6,7])
group=np.array([1,0,1,1,2,2,2,1,1,1,2,0,0,2)]
#grid search
gkf = GroupKFold( n_splits=3).split(x,y,group)
subsample = np.arange(0.3,0.5,0.1)
param_grid = dict( subsample=subsample)
rgr_xgb = XGBRegressor(n_estimators=50)
grid_search = GridSearchCV(rgr_xgb, param_grid, cv=gkf, n_jobs=-1)
result = grid_search.fit(x, y)

the error:

Traceback (most recent call last):

File "<ipython-input-143-11d785056a08>", line 8, in <module>
result = grid_search.fit(x, y)

File "/home/student/anaconda/lib/python3.5/site-packages/sklearn/grid_search.py", line 813, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))

 File "/home/student/anaconda/lib/python3.5/site-packages/sklearn/grid_search.py", line 566, in _fit
n_folds = len(cv)

TypeError: object of type 'generator' has no len()

changing the line

gkf = GroupKFold( n_splits=3).split(x,y,group)

to

gkf = GroupKFold( n_splits=3)

does not work either. The error message is then:

'GroupKFold' object is not iterable

Upvotes: 15

Views: 8533

Answers (2)

Aleksejs Fomins
Aleksejs Fomins

Reputation: 900

Here is a an optimization of Moses's answer. It may be memory-limiting to store all splits simultaneously, so we could just wrap around the original yield mechanism to return only one train/test split at a time

class KFoldHelper:
    def __init__(self, kfold: sklearn.model_selection._split._BaseKFold, x: np.ndarray,
                 classes: np.ndarray = None, groups: np.ndarray = None):
        self.iter = kfold.split(x, y = classes, groups=groups)

    def __iter__(self):
        for idxsTrain, idxsTest in self.iter:
            yield idxsTrain, idxsTest

Now we can call

kfold = KFoldHelper(GroupKFold(n_splits=3), x, classes=y, groups=group)

and

GridSearchCV(rgr_xgb, param_grid, cv=kfold, n_jobs=-1)

Upvotes: 0

Moses Koledoye
Moses Koledoye

Reputation: 78556

The split function of GroupKFold yields the training and test indices pair one at a time. You should call list on the split value to get them all in a list so the length can be computed:

gkf = list(GroupKFold( n_splits=3).split(x,y,group))

Upvotes: 29

Related Questions