sos.cott
sos.cott

Reputation: 437

Shuffle split cross validation, what are the limitations?

In the sklearn documentation for sklearn.cross_validation.ShuffleSplit, it states:

Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Is this an issue? If so, why?

Upvotes: 4

Views: 1580

Answers (1)

Slybot
Slybot

Reputation: 598

Contrary to the most often used KFold cross validation strategy, the Shuffle Split uses random samples of elements in each iteration. For a working example, let's consider a simple training dataset with 10 observations;

Training data = [1,2,3,4,5,6,7,8,9,10]

KFold (k=5)

  1. Shuffle the data, imagine it is now [6,9,1,4,10,5,7,2,3,8]
  2. Create folds; Fold 1 = [6,9], Fold 2 = [1,4], Fold 3 = [10,5], Fold 4 = [7,2] and Fold 5 = [3,8]
  3. Train keeping one fold aside each iteration for evaluation and using all others

Shuffle split (n_iter=3, test_size=0.2)

It works iterative manner where you specify number of iterations (default n_iter=10 in sklearn)

  1. Each iteration shuffle the data; [6,9,1,4,10,3,8,2,5,7], [6,2,1,4,10,7,5,9,3,8] and [2,6,1,4,10,5,7,9,3,8]
  2. Split into specified train and evaluation dataset as chosen with the hyper-parameter (test_size); Training data are [6,9,1,4,10,3,8,2], [6,2,1,4,10,7,5,9] and [2,6,1,4,10,5,7,9] respectively. Test data are [5,7], [3,8] and [3,8] respectively.

As you can notice, although the shuffle is different (technically it can be same), the training and testing data for the last two iteration are exactly same. As the number of iterations increase, your chance of fitting the same dataset increases which is counter-intuitive to the cross-validation idea where we would like get an estimate of generalizability of our model with limited amount of data. On the other hand, the datasets usually contains numerous observations so that having the same (or very similar) training and test datasets is not an issue. Keeping number of iterations high enough improves the generalizability of your results.

Upvotes: 4

Related Questions