Reputation: 1758
I'm using sklearn to train a decision tree classifier.
But there is a weird thing happened.
The accuracy returned by the decision tree's score function(0.88) is much higher than the cross_val_score
(around 0.84).
According to the document, the score function also calculates the mean accuracy.
Both of them are applied to the test dataset(87992 samples).
The cross-validation calculates on subsets, it makes sense if the result is slightly different, but now the difference is quite large.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
clf_tree = DecisionTreeClassifier()
clf_tree.fit(X_train, y_train)
print('Accuracy: %f' % clf_tree.score(X_test, y_test))
print((cross_val_score(clf_tree, X_test, y_test, cv=10, scoring='accuracy')))
print(classification_report(clf_tree.predict(X_test), y_test))
Output:
Accuracy: 0.881262
[0.84022727 0.83875 0.843164 0.84020911 0.84714172 0.83929992 0.83873167 0.8422548 0.84089101 0.84111831]
precision recall f1-score support
0 0.89 0.88 0.88 44426
1 0.88 0.89 0.88 43566
micro avg 0.88 0.88 0.88 87992
macro avg 0.88 0.88 0.88 87992
weighted avg 0.88 0.88 0.88 87992
What's really going on here? Thanks for any advice.
Upvotes: 0
Views: 316
Reputation: 3082
You have a missunderstanding of what cross_val_score
does.
Assuming you have a Dataset with 100 rows and split that into train (70%) and test (30%) then you will train with 70 rows and test with 30 in the following part of your code:
clf_tree = DecisionTreeClassifier()
clf_tree.fit(X_train, y_train)
print('Accuracy: %f' % clf_tree.score(X_test, y_test))
Later on the other hand you call
print((cross_val_score(clf_tree, X_test, y_test, cv=10, scoring='accuracy')))
Here cross_val_score
takes your 30 rows of test data and splits them in to 10 parts. Then it uses 9 parts for training and 1 part to test that completly new trained classifier. That will be repeated until each block was tested once (10 times).
So at the end your first classifier was trained with 70% of your data, while the 10 classifiers of your cross_val_score
where trained with 27% of your data.
And often in machine learning we see that more data gets better results.
To make the point clear. In your code the following two lines would do exactly the same:
print((cross_val_score(clf_tree, X_test, y_test, cv=10, scoring='accuracy')))
print((cross_val_score(DecisionTreeClassifier(), X_test, y_test, cv=10, scoring='accuracy')))
Upvotes: 3