Reputation: 10159
Here are the related code and document, wondering for the default cross_val_score
without explicitly specify score
, the output array means precision, AUC or some other metrics?
Using Python 2.7 with miniconda interpreter.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
regards, Lin
Upvotes: 4
Views: 7731
Reputation: 96277
From the user guide:
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:
From the DecisionTreeClassifier documentation:
Returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Don't be confused by "mean accuracy," it's just the regular way one computes accuracy. Follow the links to the source:
from .metrics import accuracy_score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
Now the source for metrics.accuracy_score
def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
...
# Compute accuracy for each possible representation
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
if y_type.startswith('multilabel'):
differing_labels = count_nonzero(y_true - y_pred, axis=1)
score = differing_labels == 0
else:
score = y_true == y_pred
return _weighted_sum(score, sample_weight, normalize)
And if you still aren't convinced:
def _weighted_sum(sample_score, sample_weight, normalize=False):
if normalize:
return np.average(sample_score, weights=sample_weight)
elif sample_weight is not None:
return np.dot(sample_score, sample_weight)
else:
return sample_score.sum()
Note: for accuracy_score
normalize parameter defaults to True
, thus it simply returns np.average
of the boolean numpy arrays, thus it's simply the average number of correct predictions.
Upvotes: 3
Reputation: 14857
If a scoring argument isn't given, cross_val_score
will default to using the .score
method of the estimator you're using. For DecisionTreeClassifier
, it's mean accuracy (as shown in the docstring below):
In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Docstring:
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns
-------
score : float
Mean accuracy of self.predict(X) wrt. y.
Upvotes: 1