Reputation: 53806
Reading doc for k fold cross validation http://scikit-learn.org/stable/modules/cross_validation.html I'm attempting to understand the training procedure for each of the folds.
Is this correct :
In generating the cross_val_score
each fold contains a new training and test set , these training and test sets are utilized by the passed in classifier clf
in below code for evaluating each fold performance ?
This implies that increasing size of fold can affect accuracy depending on size of training set as increase number of folds reduces training data available for each fold ?
From doc cross_val_score
is generated using :
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
Upvotes: 1
Views: 6803
Reputation: 544
I don't think the statement "each fold contains a new training and test set" is correct.
By default, cross_val_score
uses KFold
cross-validation. This works by splitting the data set into K equal folds. Say we have 3 folds (fold1, fold2, fold3), then the algorithm works as follows:
So each fold is used for both training and testing.
Now to second part of your question. If you increase the number of rows of data in a fold, you do reduce the number of training samples for each of the runs (above, that would be run 1, 2, and 3) but the total number of training samples is unchanged.
Generally, selecting the right number of folds is both art and science. For some heuristics on how to choose your number of folds, I would suggest this answer. The bottom line is that accuracy can be slightly affected by your choice of the number of folds. For large data sets, you are relatively safe with a large number of folds; for smaller data sets, you should run the exercise multiple times with new random splits.
Upvotes: 7