StratifiedKFold vs StratifiedShuffleSplit vs StratifiedKFold + Shuffle

Question

What is the difference between: StratifiedKFold, StratifiedShuffleSplit, StratifiedKFold + Shuffle? When should I use each one? When I get a better accuracy score? Why I do not get similar results? I have put my code and the results. I am using Naive Bayes and 10x10 cross-validation.

   #######SKF FOR LOOP########
from sklearn.cross_validation import StratifiedKFold
for i in range(10):
    skf = StratifiedKFold(y, n_folds=10, shuffle=True)
    scoresSKF2 = cross_validation.cross_val_score(clf, x, y , cv=skf)
    print(scoresSKF2)
    print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSKF2.mean(), scoresSKF2.std()* 2))
    print("") 

    [ 0.1750503   0.16834532  0.16417051  0.18205424  0.1625758   0.1750939
      0.15495808  0.1712963   0.17096494  0.16918166]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.16297787  0.17956835  0.17309908  0.17686093  0.17239388  0.16093615
     0.16970223  0.16956019  0.15473776  0.17208358]
   Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17102616  0.16719424  0.1733871   0.16560877  0.166041    0.16122508
     0.16767852  0.17042824  0.18719212  0.1677307 ]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17275079  0.16633094  0.16906682  0.17570687  0.17210511  0.15515747
     0.16594391  0.18113426  0.16285135  0.1746953 ]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.1764875   0.17035971  0.16186636  0.1644547   0.16632977  0.16469229
     0.17635155  0.17158565  0.17849899  0.17005223]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16815177  0.16863309  0.17309908  0.17368725  0.17152758  0.16093615
     0.17143683  0.17158565  0.16574906  0.16511898]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16786433  0.16690647  0.17309908  0.17022504  0.17066128  0.16613695
     0.17259324  0.17737269  0.16256158  0.17643645]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.16297787  0.16402878  0.17684332  0.16791691  0.16950621  0.1716267
      0.18328997  0.16984954  0.15792524  0.17701683]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

   [ 0.16958896  0.16633094  0.17165899  0.17080208  0.16026567  0.17538284
     0.17490604  0.16840278  0.17502173  0.16511898]
    Accuracy SKF_NB: 0.17 (*/- 0.01)

    [ 0.17275079  0.15625899  0.17713134  0.16762839  0.18278949  0.16729269
     0.16449841  0.17303241  0.16111272  0.1610563 ]
    Accuracy SKF_NB: 0.17 (*/- 0.02)


  #####StratifiedKFold + Shuffle######
  from sklearn.utils import shuffle
  for i in range(10):
      X, y = shuffle(x, y, random_state=i)
      skf = StratifiedKFold(y, 10)
      scoresSKF2 = cross_validation.cross_val_score(clf, X, y , cv=skf)
      print(scoresSKF2)
      print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSKF2.mean(), scoresSKF2.std()* 2))
      print("")

   [ 0.16700201  0.15913669  0.16359447  0.17772649  0.17297141  0.16931523
    0.17172593  0.18576389  0.17125471  0.16134649]
    Accuracy SKF_NB: 0.17 (*/- 0.02)

    [ 0.02874389  0.02705036  0.02592166  0.02740912  0.02714409  0.02687085
     0.02891009  0.02922454  0.0260794   0.02814858]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.0221328   0.02848921  0.02361751  0.02942874  0.02598903  0.02947125
     0.02804279  0.02719907  0.02376123  0.02205456]
    Accuracy SKF_NB: 0.03 (*/- 0.01)

   [ 0.02788158  0.02848921  0.03081797  0.03289094  0.02829916  0.03293846
     0.02862099  0.02633102  0.03245436  0.02843877]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02874389  0.0247482   0.02448157  0.02625505  0.02483396  0.02860445
     0.02948829  0.02604167  0.02665894  0.0275682 ]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.0221328   0.02705036  0.02476959  0.02510098  0.02454519  0.02687085
      0.02254987  0.02199074  0.02492031  0.02524666]
    Accuracy SKF_NB: 0.02 (*/- 0.00)

    [ 0.02615694  0.03079137  0.02102535  0.03029429  0.02252382  0.02889338
       0.02197167  0.02604167  0.02752825  0.02843877]
    Accuracy SKF_NB: 0.03 (*/- 0.01)

    [ 0.02673182  0.02676259  0.03197005  0.03115984  0.02512273  0.03236059
      0.02688638  0.02372685  0.03216459  0.02698781]
     Accuracy SKF_NB: 0.03 (*/- 0.01)

    [ 0.0258695   0.02964029  0.03081797  0.02740912  0.02916546  0.02976018
      0.02717548  0.02922454  0.02694871  0.0275682 ]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.03506755  0.0247482   0.02592166  0.02740912  0.02772163  0.02773765
      0.02948829  0.0234375   0.03332367  0.02118398]
    Accuracy SKF_NB: 0.03 (*/- 0.01)


    ######StratifiedShuffleSplit##########
   from sklearn.cross_validation import StratifiedShuffleSplit
   for i in range(10):
       sss = StratifiedShuffleSplit(y, 10, test_size=0.1, random_state=0)
       scoresSSS = cross_validation.cross_val_score(clf, x, y , cv=sss)
       print(scoresSSS)
       print("Accuracy SKF_NB: %0.2f (*/- %0.2f)" % (scoresSSS.mean(), scoresSSS.std()* 2))
       print("")
    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
      Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

     [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
       0.02570026  0.02454519  0.02570026  0.02858793]
     Accuracy SKF_NB: 0.03 (*/- 0.00)

    [ 0.02743286  0.02858793  0.02512273  0.02281259  0.02541149  0.02743286
      0.02570026  0.02454519  0.02570026  0.02858793]
    Accuracy SKF_NB: 0.03 (*/- 0.00)

Rabbit · Accepted Answer

Its hard to say which one is better. The choice should be more about your strategy and goals of your modeling, however there is a strong preference in the community for using K-fold cross-validation for both model selection and performance estimation. I will try to give you some intuition for the two main concepts that will guide your choice of sampling techniques: Stratification and Cross Validation/Random Split.

Also keep in mind you can use these sampling techniques for two very different goals: model selection and performance estimation.

Stratification works by keeping the balance or ratio between labels/targets of the dataset. So if your whole dataset has two labels (e.g. Positive and Negative) and these have a 30/70 ratio, and you split in 10 subsamples, each stratified subsample should keep the same ratio. Reasoning: Because machine-learned model performance in general are very sensitive to a sample balancing, using this strategy often makes the models more stable for subsamples.

Split vs Random Split. A split is just a split, usually for the purpose of having separate training and testing subsamples. But taking the first X% for a subsample and the remaining for another subsample might not be a good idea because it can introduce very high bias. There is where random split comes into play, by introducing randomness for the subsampling.

K-fold Cross-validation vs Random Split. K folds comprises in creating K subsamples. Because you now have a more significant number of samples (instead of 2), you can separate one of the subsamples for testing and the remaining subsamples for training, do this for every possible combination of testing/training folds and average the results. This is known as cross-validation. Doing K-fold cross-validation is like doing a (not random) split k times, and then averaging. A small sample might not benefit from k-fold cross-validation, while a large sample usually always do benefit from cross-validation. A random split is a more efficient (faster) way of estimating, but might be more prone to sampling bias than k-fold cross-validation. Combining stratification and random split is an attempt to have an effective and efficient sampling strategy that preserves label distribution.

StratifiedKFold vs StratifiedShuffleSplit vs StratifiedKFold + Shuffle

Answers (2)

Related Questions