Reputation: 437
In the sklearn documentation for sklearn.cross_validation.ShuffleSplit, it states:
Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
Is this an issue? If so, why?
Upvotes: 4
Views: 1580
Reputation: 598
Contrary to the most often used KFold cross validation strategy, the Shuffle Split uses random samples of elements in each iteration. For a working example, let's consider a simple training dataset with 10 observations;
Training data = [1,2,3,4,5,6,7,8,9,10]
KFold (k=5)
Shuffle split (n_iter=3, test_size=0.2)
It works iterative manner where you specify number of iterations (default n_iter=10 in sklearn)
As you can notice, although the shuffle is different (technically it can be same), the training and testing data for the last two iteration are exactly same. As the number of iterations increase, your chance of fitting the same dataset increases which is counter-intuitive to the cross-validation idea where we would like get an estimate of generalizability of our model with limited amount of data. On the other hand, the datasets usually contains numerous observations so that having the same (or very similar) training and test datasets is not an issue. Keeping number of iterations high enough improves the generalizability of your results.
Upvotes: 4