Reputation: 99
I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.
I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?! The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...
A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat
training: I am not taking the hierarchy into account.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?
Here's the code I'm using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib
Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl')
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl')
...
#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)
Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).
I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).
Upvotes: 8
Views: 6234
Reputation: 363807
"...the probability outputs from predict_proba are not to be taken too seriously"
I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression
or sklearn.linear_model.SGDClassifier(loss="log")
) produces more realistic probabilities.
The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
That's because GaussianNB
is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray
). Use MultinomialNB
, BernoulliNB
or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB
is really not a good estimator for text classification.
Upvotes: 15