Reputation: 3057
I am dividing my training set into stratified k-folds as follows:
n_folds = 5
skf = list(StratifiedKFold(y, n_folds, random_state=SEED))
for k, (train, test) in enumerate(skf):
X_train = X[train]
y_train = y[train]
X_val = X[test]
y_val = y[test]
clf.fit(X_train, y_train)
preds = clf.predict_proba(X_val)
The classification accuracy for the first 4 folds is as expected. The last fold has significantly worse accuracy.
I have tried varying the values of SEED and n_folds, in all cases, the last fold is always the worst (for 5 folds, by about 3%). Why is this happening?
Thank you.
Upvotes: 0
Views: 233
Reputation: 3057
It turns out that StratifiedKFold does not shuffle the data by default. Therefore, I needed to set the shuffle param to True:
n_folds = 10
skf = list(StratifiedKFold(y, n_folds, shuffle=True, random_state=SEED))
Upvotes: 1