Reputation: 61
I'm trying to calculate the confidence interval for my classification model using DecisionTreeClassifier in scikit-learn.
Reading the scikit-learn documentation about cross validation and confidence intervals (https://scikit-learn.org/dev/modules/cross_validation.html) I found the code below and it seemed pretty straight forward; however, I don't understand why the upper limit is greater than 1, how can the accuracy be higher than 100%?
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)
Upvotes: 6
Views: 3802
Reputation: 1004
As mentioned here, it may be a better idea to just clip the confidence interval.
Upvotes: 0
Reputation: 4221
It cannot be larger than 1 obviously.
The underlying assumption in this code is that the scores computed in scores
are distributed according to the Normal Distribution. Then the 95% confidence interval is given by mean+/- 2*std
.
It gives sensible results most of the time, but in your case, it is just ill-defined because the mean accuracy is already so close to 1. I know this is not a great solution, but maybe you can reduce your confidence interval to 68%? Then you would just need to remove the factor 2 in front of the std
, and the upper bound would be 99.5%.
Upvotes: 2