Reputation: 7251
I am struggling to properly utilize sklearn
's StratifiedKFold
code.
I have an extremely large dataset (X
), and subsequent list of classes (y
), that is imbalanced. I am looking to break that up into 9 stratified folds.
However, The results are not what I am expecting. I am essentially appending the entire dataset each time, and creating 9 folds of the entire dataset. What is quirky, is that I am not looking to get a train and test split for each fold, I just want a stratified split of my data. (i.e., take my data / 9 by maintain the class imbalance).
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
skf = StratifiedKFold(n_splits = 9, random_state=8, shuffle=False)
# Lists to hold the fold data and the fold classes
fold_data = []
fold_classes = []
print(X.shape)
print(y.shape)
unique, counts = np.unique(y, return_counts=True)
print(dict(zip(unique, counts)))
# Split into 9 splits
for train_index, test_index in skf.split(X, y):
# Get the first fold
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train.shape, X_test.shape) # Why this shape so big?
# Use numpy to concatenate the training and testing data
temp_data = np.concatenate((X_train, X_test), axis=0)
temp_classes = np.concatenate((y_train, y_test), axis=0)
# Append the current fold to the overall folds
fold_data.append(temp_data)
fold_classes.append(temp_classes)
print("overall size: {}".format(X.shape))
for x in fold_data:
print("Example fold size: {}".format(x.shape))
Yields:
(240970, 3291)
(240970,)
{0: 196365, 1: 44605}
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
(214195, 3291) (26775, 3291)
I have looked at various resources, and cannot figure out how to properly accomplish what I am doing. I am looking for something that effectively creates something to the effect of 9 distinct folds, no overlapping data, with dimensions of approximately 26,774
rows, with each class maintaining its split (about 21,818
of class 0
and 4956
of class 1
)
UPDATE
I tried using StratifiedShuffleSplit
but get the same problem. Each fold is all of the data, not 1/9 of the data.
Upvotes: 0
Views: 272
Reputation: 12587
You could create the 9 splits by storing the test indices at each split and then use these to create your 9 data splits
from sklearn.model_selection import StratifiedKFold
import numpy as np
X = np.zeros(240970 * 10).reshape(240970, -1) # shape: (240970, 10)
y = np.random.randint(5, size=240970) # shape: (240970, )
skf = StratifiedKFold(n_splits=9, shuffle=False)
splits = []
for train_index, test_index in skf.split(X, y):
splits.append(test_index)
# Flatten the test indices
flat_idxs = np.concatenate(splits).ravel()
# Check the number of unique indices equals the shape of X
np.unique(flat_idxs).shape[0] == X.shape[0] # True
This doesn't create overlapping test indices with the other splits because each test set should be unique.
Upvotes: 2