Sait
Sait

Reputation: 19855

Leave one out cross validation using Sklearn

I am trying to use cross validation to test my classifier using Sklearn.

I have 3 classes, and total of 50 samples.

The following runs as expected, which is presumably making 5-folds cross validation.

result = cross_validation.cross_val_score(classifier, X, y, cv=5)

I am trying to do leave-one-out with using cv=50 folds, so I do the following,

result = cross_validation.cross_val_score(classifier, X, y, cv=50)

However, surprisingly, it gives the following error:

/Library/Python/2.7/site-packages/sklearn/cross_validation.py:413: Warning: The least populated class in y has only 5 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=50.
  % (min_labels, self.n_folds)), Warning)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py:67: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "b.py", line 96, in <module>
    scores1 = cross_validation.cross_val_score(classifier, X, y, cv=50)
  File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1151, in cross_val_score
    for train, test in cv)
  File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1240, in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)
  File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1296, in _score
    score = scorer(estimator, X_test, y_test)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/scorer.py", line 176, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/sklearn/base.py", line 291, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/Library/Python/2.7/site-packages/sklearn/neighbors/classification.py", line 147, in predict
    neigh_dist, neigh_ind = self.kneighbors(X)
  File "/Library/Python/2.7/site-packages/sklearn/neighbors/base.py", line 332, in kneighbors
    return_distance=return_distance)
  File "binary_tree.pxi", line 1307, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn/neighbors/kd_tree.c:10506)
  File "binary_tree.pxi", line 226, in sklearn.neighbors.kd_tree.get_memview_DTYPE_2D (sklearn/neighbors/kd_tree.c:2715)
  File "stringsource", line 247, in View.MemoryView.array_cwrapper (sklearn/neighbors/kd_tree.c:24789)
  File "stringsource", line 147, in View.MemoryView.array.__cinit__ (sklearn/neighbors/kd_tree.c:23664)
ValueError: Invalid shape in axis 0: 0.

Also, another weird thing is, when I do cv=5, I don't get any warnings. When I do cv=50 I get the above warning which is weird. Because I think when cv gets bigger, even though it may be computationally harder, the result should be more accurate. Is there any gap with my reasoning? Why do I get the Warning and error?

How can I do leave-one-out cross validation in this scenario properly?

Upvotes: 2

Views: 2792

Answers (1)

Andreas Mueller
Andreas Mueller

Reputation: 28788

By default, cv=5 for classification does stratified 5-fold cross-validation. That means it tries to keep the fraction of samples from one class constant. It might be that this results in trouble when the number of folds is the same as the number of samples. Which version are you on? This error message is certainly not very helpful.

Btw, in general I'd suggest you use StratifiedShuffleSplit for such a small dataset.

[edit]: the current version gives a warning, which should probably be an error:

sklearn/cross_validation.py:399: Warning: The least populated class in y has only 13 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=68. % (min_labels, self.n_folds)), Warning)

Upvotes: 5

Related Questions