Reputation: 4671
I see papers that use 10-fold cross validation on data sets that have a number of samples indivisible by 10.
I couldn't find any case where they explained how they chose each subset.
My assumption is that they use resampling to some extent, but if this were to be the case then a sample could appear in both subsets and therefore bias the model.
Paper as example: http://www.biomedcentral.com/1471-2105/9/319
Would it be recommended to do the following;
Doing it this way would have it so every sample is a training set but only 80/86 samples are used as holdouts and there is no bias of having it occur within both a training and holdout set.
Any insight would be appreciated.
Upvotes: 1
Views: 1819
Reputation: 43477
You want the folds to have equal size, or as close to equal as possible.
To do this, if you have 86
samples and want to use 10 fold CV, then the first 86 % 10 = 6
folds will have size 86 / 10 + 1 = 9
and the rest will have size 86 / 10 = 8
:
6 * 9 = 54
4 * 8 = 32 +
--------------
86
In general, if you have n
samples and n_folds
folds, you want to do what scikit-learn does:
The first n % n_folds folds have size n // n_folds + 1, other folds have size n // n_folds.
Note: //
stands for integer division
I'm not aware of a proper scientific reference for this, but it seems to be the convention. See this question and also this one for the same suggestions. At least two major machine learning libraries do it this way.
Upvotes: 4