Biasing Sklearn toward positives For MultinomialNB

Question

I am trying to run multinomial naive bayes on a series of examples in python using sci kit learn. I am consitently getting all examples classified as negative. The training set is somewhat biased towards negatives P(negative) ~.75. I looked through the documentation and I couldn't find a way to bias toward positives.

from sklearn.datasets import load_svmlight_file
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
X_train, y_train= load_svmlight_file("POS.train")
x_test, y_test = load_svmlight_file("POS.val")
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('accuracy: ' + str(accuracy_score(y_test, preds)))
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))

AN6U5 · Accepted Answer

Setting a prior is a poor way to handle this and will result in negative cases being classified as positive that really shouldn't be. Your data has a .25/.75 split, so a .5/.5 prior is a pretty bad option.

Instead, one can average the precision and recall with a harmonic mean to produce an F score which attempts to properly handle biased data like this:

from sklearn.metrics import f1_score

The F1 score can then be used to assess the quality of the model. You can then do some model tuning and cross validation to find a model that better classifies your data i.e. the model that maximizes the F1 score.

Another option is to randomly prune out the negative cases in your data so that the classifier is trained with .5/.5 data. The predict step should then give more appropriate classifications.

Biasing Sklearn toward positives For MultinomialNB

Answers (1)

Related Questions