Last Stratified K-Fold Performance Distinct

Question

I am dividing my training set into stratified k-folds as follows:

n_folds = 5
skf = list(StratifiedKFold(y, n_folds, random_state=SEED))

for k, (train, test) in enumerate(skf):
      X_train = X[train]
      y_train = y[train]
      X_val = X[test]
      y_val = y[test]

      clf.fit(X_train, y_train)
      preds = clf.predict_proba(X_val)

The classification accuracy for the first 4 folds is as expected. The last fold has significantly worse accuracy.

I have tried varying the values of SEED and n_folds, in all cases, the last fold is always the worst (for 5 folds, by about 3%). Why is this happening?

Thank you.

Chris Parry · Accepted Answer

It turns out that StratifiedKFold does not shuffle the data by default. Therefore, I needed to set the shuffle param to True:

n_folds = 10
skf = list(StratifiedKFold(y, n_folds, shuffle=True, random_state=SEED))

Last Stratified K-Fold Performance Distinct

Answers (1)

Related Questions