user11041755
user11041755

Reputation: 61

Scikit-learn : Cross validation and Confidence Intervals

I'm trying to calculate the confidence interval for my classification model using DecisionTreeClassifier in scikit-learn.

Reading the scikit-learn documentation about cross validation and confidence intervals (https://scikit-learn.org/dev/modules/cross_validation.html) I found the code below and it seemed pretty straight forward; however, I don't understand why the upper limit is greater than 1, how can the accuracy be higher than 100%?

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)

Upvotes: 6

Views: 3802

Answers (2)

kelkka
kelkka

Reputation: 1004

As mentioned here, it may be a better idea to just clip the confidence interval.

Upvotes: 0

MaximeKan
MaximeKan

Reputation: 4221

It cannot be larger than 1 obviously.

The underlying assumption in this code is that the scores computed in scores are distributed according to the Normal Distribution. Then the 95% confidence interval is given by mean+/- 2*std.

It gives sensible results most of the time, but in your case, it is just ill-defined because the mean accuracy is already so close to 1. I know this is not a great solution, but maybe you can reduce your confidence interval to 68%? Then you would just need to remove the factor 2 in front of the std, and the upper bound would be 99.5%.

Upvotes: 2

Related Questions