Reputation: 43
Here's the code:
xtrain, xtest, ytrain, ytest = train_test_split(xx, yy, test_size=0.50)
clf = MultinomialNB(alpha=1.0)
clf.fit(xtrain, ytrain)
predictions = clf.predict(xtest)
print 'score:', metrics.accuracy_score(ytest, predictions)
Standard stuff, but here's the problem. The score, as you can see below, is impossibly high. The actual outcome (not showing the code for that but it's just basic reporting on predictions vs the Y column) is that 3621 rows were predicted to be in the class. Of those, only 299 actually were (true positives). Nothing at all like 99% accuracy.
score: 0.9942950664902702
num rows: 644004
Y == 1: 651
picked: 3621 | true positives: 299 | false positives: 3322
I didn't want to tag this as related specifically to MultinomialNB because I found that RandomForestClassifier gives the same result. The problem (or the problem with me) appears to be related to the scoring function itself.
Upvotes: 0
Views: 2030
Reputation: 60400
This sounds like a textbook example of why accuracy is not meaningful for heavily imbalanced datasets.
That your (test) dataset is heavily imbalanced is clear from the aggregate statistics you have provided: out of 644004 samples, only 651 belong to the positive class, or just 0.1% (and I bet that the composition of your training set is similar).
Under such circumstances, it is easy to show that the accuracy you get is indeed realistic (only meaningless); from the definition of accuracy:
acc = (correctly classified samples)/(total samples)
= (total samples - FP - FN)/(total samples)
Ignoring the false negatives (FN), for which you do not provide any info, we get:
(644004 - 3322)/644004
# 0.9948416469462923
which, as expected, is only slightly higher than your reported accuracy (since I have not accommodated for the false negatives -FN- which you also certainly get), but still in the 99% range. The bottom line is your accuracy is correct, but useless (i.e. doesn't tell you anything useful about your model).
You should start googling about "class imbalance", which is a separate (and huge) sub-topic, with its own peculiarities. Intuitively speaking, accuracy is meaningless here because, as demonstrated clearly by your own data, a classifier trained on data where the positive class (which is usually the class of interest) consists of, say, only ~ 0.1% of all samples, can report accuracy of 99.9% by simply classifying every sample as belonging to the negative class (which is not exactly what has happened here, but hopefully you get the idea). Special methods, and different metrics (precision, recall, F1-score etc.) are applicable for imbalanced datasets....
Upvotes: 2