gugatr0n1c
gugatr0n1c

Reputation: 477

Scikit-learn, GroupKFold with shuffling groups?

I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold.

Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold.

Shuffling is in group sence - so whole groups should be shuffle among each other.

Is there way to do is with scikit-learn elegant somehow? Seems to me GroupKFold is robust to shuffle data first.

If there is no way to do it with scikit, can anyone write some effective code of this? I have large data sets.

matrix, label, groups as inputs

Upvotes: 13

Views: 9250

Answers (3)

David R
David R

Reputation: 1044

Here is a performant solution that essentially reassigns the values of the keys in a way that respects the original groups.

Code is shown below, but the 4 steps are:

  1. Shuffle the grouping-key vector. The key goal here is rearrange the first time each grouping key appears.
  2. Use np.unique() to return the first_index values for each unique key and the inverse_index values that could be used to reconstruct the grouping-key vector.
  3. Use fancy indexing of the inverse indexes operating on the first_index values to construct a new array of grouping keys where each grouping key has been transformed to a number representing the order in which it first shows up in the shuffled grouping vector.
  4. This new vector of grouping keys can be used in the standard GroupKFold splitter to get a different set of splits than the original because you have reordered the grouping indexes.

To give a quick example, imagine your original grouping-key vector was [3, 1, 1, 5, 3, 5], then this procedure would create a new grouping key vector [0, 1, 1, 2, 0, 2]. The 3's have become 0's because they were the first key to show up, the 1's have become 1's because they were the second key to show up, and the 5's have become 2's because they were the 3rd key to show up. As long as you shuffle the keys, you will get a transformation of grouping-keys, leading to a different set of splits by GroupKFold.

Code:

# Say that A is the official grouping key
A = list(range(10)) + list(range(10))
B = list(range(20))
y = np.zeros(20)

X = pd.DataFrame({
    'group': A,
    'var': B
})

X = X.sample(frac=1)

original_grouping_vector = X['group']
unique_values, indexes, inverse = np.unique(original_grouping_vector, return_inverse=True, return_index=True)
new_grouping_vector = indexes[inverse] # This is where the magic happens!

splitter = GroupKFold()
for train, test in splitter.split(X, y, groups=new_grouping_vector):
    print(X.iloc[test, :])

The above will print out different splits upon shuffling because the grouping-keys are being reordered, causing the value of new_grouping_vector to change.

Upvotes: 0

Mukul Gupta
Mukul Gupta

Reputation: 21

The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds)

In the GroupKfold the shape of the group is the same as data shape

For data in X, y and groups:

import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import datetime

X = np.array([[1,2,1,1], [3,4,7,8], [5,6,1,3], [7,8,4,7]])
y=np.array([0,2,1,2])
groups=np.array([2,1,0,1])  
group_kfold = GroupKFold(n_splits=len(groups.unique))
group_kfold.get_n_splits(X, y, groups)

 param_grid ={
        'min_child_weight': [50,100],
        'subsample': [0.1,0.2],
        'colsample_bytree': [0.1,0.2],
        'max_depth': [2,3],
        'learning_rate': [0.01],
        'n_estimators': [100,500],
        'reg_lambda': [0.1,0.2]        
        }

xgb = XGBClassifier()

grid_search = GridSearchCV(xgb, param_grid, cv=group_kfold.split(X, Y, groups), n_jobs=-1)

result = grid_search.fit(X,Y)

Upvotes: 2

Melissa
Melissa

Reputation: 765

EDIT: This solution does not work.

I think using sklearn.utils.shuffle is an elegant solution!

For data in X, y and groups:

from sklearn.utils import shuffle
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=0)

Then use X_shuffled, y_shuffled and groups_shuffled with GroupKFold:

from sklearn.model_selection import GroupKFold
group_k_fold = GroupKFold(n_splits=10)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)

Of course, you probably want to shuffle multiple times and do the cross-validation with each shuffle. You could put the entire thing in a loop - here's a complete example with 5 shuffles (and only 3 splits instead of your required 10):

X = np.arange(20).reshape((10, 2))
y = np.arange(10)
groups = [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]

n_shuffles = 5
group_k_fold = GroupKFold(n_splits=3)

for i in range(n_shuffles):
    X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=i)
    splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
    # do something with splits here, I'm just printing them out
    print 'Shuffle', i
    print 'groups_shuffled:', groups_shuffled
    for train_idx, val_idx in splits:
        print 'Train:', train_idx
        print 'Val:', val_idx

Upvotes: 12

Related Questions