artemis
artemis

Reputation: 7251

StratifiedKFold failing to produce 9 distinct folds - Python 3.7

I am struggling to properly utilize sklearn's StratifiedKFold code.

I have an extremely large dataset (X), and subsequent list of classes (y), that is imbalanced. I am looking to break that up into 9 stratified folds.

However, The results are not what I am expecting. I am essentially appending the entire dataset each time, and creating 9 folds of the entire dataset. What is quirky, is that I am not looking to get a train and test split for each fold, I just want a stratified split of my data. (i.e., take my data / 9 by maintain the class imbalance).

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
skf = StratifiedKFold(n_splits = 9, random_state=8, shuffle=False)

# Lists to hold the fold data and the fold classes
fold_data = []
fold_classes = []

print(X.shape)
print(y.shape)

unique, counts = np.unique(y, return_counts=True)
print(dict(zip(unique, counts)))

# Split into 9 splits
for train_index, test_index in skf.split(X, y):    
    # Get the first fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train.shape, X_test.shape) # Why this shape so big? 
    
    # Use numpy to concatenate the training and testing data
    temp_data = np.concatenate((X_train, X_test), axis=0)
    temp_classes = np.concatenate((y_train, y_test), axis=0)
    
    # Append the current fold to the overall folds
    fold_data.append(temp_data)
    fold_classes.append(temp_classes)

print("overall size: {}".format(X.shape))
for x in fold_data:
    print("Example fold size: {}".format(x.shape))

Yields:

(240970, 3291)
(240970,)
{0: 196365, 1: 44605}
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)

I have looked at various resources, and cannot figure out how to properly accomplish what I am doing. I am looking for something that effectively creates something to the effect of 9 distinct folds, no overlapping data, with dimensions of approximately 26,774 rows, with each class maintaining its split (about 21,818 of class 0 and 4956 of class 1)

UPDATE I tried using StratifiedShuffleSplit but get the same problem. Each fold is all of the data, not 1/9 of the data.

Upvotes: 0

Views: 272

Answers (1)

Harpal
Harpal

Reputation: 12587

You could create the 9 splits by storing the test indices at each split and then use these to create your 9 data splits

from sklearn.model_selection import StratifiedKFold
import numpy as np

X = np.zeros(240970 * 10).reshape(240970, -1) # shape: (240970, 10)
y = np.random.randint(5, size=240970) # shape: (240970, )

skf = StratifiedKFold(n_splits=9, shuffle=False)

splits = []
for train_index, test_index in skf.split(X, y): 
    splits.append(test_index)

# Flatten the test indices
flat_idxs = np.concatenate(splits).ravel()

# Check the number of unique indices equals the shape of X
np.unique(flat_idxs).shape[0] == X.shape[0] # True

This doesn't create overlapping test indices with the other splits because each test set should be unique.

Upvotes: 2

Related Questions