Lin Ma
Lin Ma

Reputation: 10159

scikit learn decision tree model evaluation

Here are the related code and document, wondering for the default cross_val_score without explicitly specify score, the output array means precision, AUC or some other metrics?

Using Python 2.7 with miniconda interpreter.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...                             
...
array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        0.93...,  0.93...,  1.     ,  0.93...,  1.      ])

regards, Lin

Upvotes: 4

Views: 7731

Answers (2)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96277

From the user guide:

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:

From the DecisionTreeClassifier documentation:

Returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Don't be confused by "mean accuracy," it's just the regular way one computes accuracy. Follow the links to the source:

    from .metrics import accuracy_score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)

Now the source for metrics.accuracy_score

def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
    ...
    # Compute accuracy for each possible representation
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    if y_type.startswith('multilabel'):
        differing_labels = count_nonzero(y_true - y_pred, axis=1)
        score = differing_labels == 0
    else:
        score = y_true == y_pred

    return _weighted_sum(score, sample_weight, normalize)

And if you still aren't convinced:

def _weighted_sum(sample_score, sample_weight, normalize=False):
    if normalize:
        return np.average(sample_score, weights=sample_weight)
    elif sample_weight is not None:
        return np.dot(sample_score, sample_weight)
    else:
        return sample_score.sum()

Note: for accuracy_score normalize parameter defaults to True, thus it simply returns np.average of the boolean numpy arrays, thus it's simply the average number of correct predictions.

Upvotes: 3

Randy
Randy

Reputation: 14857

If a scoring argument isn't given, cross_val_score will default to using the .score method of the estimator you're using. For DecisionTreeClassifier, it's mean accuracy (as shown in the docstring below):

In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Docstring:
Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.

Parameters
----------
X : array-like, shape = (n_samples, n_features)
    Test samples.

y : array-like, shape = (n_samples) or (n_samples, n_outputs)
    True labels for X.

sample_weight : array-like, shape = [n_samples], optional
    Sample weights.

Returns
-------
score : float
    Mean accuracy of self.predict(X) wrt. y.

Upvotes: 1

Related Questions