Reputation: 1815
I have a dataset of reviews which has a class label of positive/negative. I am applying Naive Bayes to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(sorted_data['Text'].values)
I am splitting the data into train and test dataset.
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0)
I am applying the naive bayes algorithm as follows
optimal_alpha = 1
NB_optimal = BernoulliNB(alpha=optimal_aplha)
# fitting the model
NB_optimal.fit(X_tr, y_tr)
# predict the response
pred = NB_optimal.predict(X_test)
# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the NB classifier for k = %d is %f%%' % (optimal_aplha, acc))
Here X_test is test dataset in which pred variable gives us whether the vector in X_test is positive or negative class.
The X_test shape is (54626 rows, 82343 dimensions)
length of pred is 54626
My question is I want to get the words with highest probability in each vector so that I can get to know by the words that why it predicted as positive or negative class. Therefore, how to get the words which have highest probability in each vector?
Upvotes: 17
Views: 35766
Reputation: 616
Unfortunately, I disagree with the accepted answer, since they are outputting the conditional log probs. For example (this is what actually happened to me and that's why I proposed a different approach), let's say you have a sentiment analysis with Naive Bayes and you use feature_log_prob_ as in the answer. feature_log_prob_ of the word 'the' is Prob(the | y==1), since the word 'the' is really likely to occur, this will be considired an 'important feature'; but bear in mind that Prob(the | y==1) is also really likely (because the word 'the' occurs no matter if the sentiment is positive or negative). Therefore if you did not remove stopwords, this word will be more important than the word 'WORSE' for predicting a negative class in a sentiment classifier. This is clearly unintuitive and wrong. It happened to me in a project.
In case someone mentions that still the Prob(the | y==1) is high: Yes, the word 'the' helps to predict the positive class since Prob(the | y==1) is high, but it will equally help to predict the negative class and therefore has no effect (is not important).
Proposed solution:
I had the same trouble, maybe this is for datascience exchange forum but I want to post it here since I achieved a very good result.
First:
We are going to build odds ratio, which can be demostrated that it is equal to P(word i ,+) / P(word i ,-) (let me know if you need the demostration of it guys). If this ratio is greater than 1 means that the word i is more likely to occur in a positive texts than in negative text.
These are the priors in the naive bayes model:
# remember normalize=True outputs proportion
prob_pos = df_train['y'].value_counts(normalize=True)[0]
prob_neg = df_train['y'].value_counts(normalize=True)[1]
Create a dataframe for storing the words
df_nbf = pd.DataFrame()
df_nbf.index = count_vect.get_feature_names()
# Convert log probabilities to probabilities.
df_nbf['pos'] = np.e**(nb.feature_log_prob_[0, :])
df_nbf['neg'] = np.e**(nb.feature_log_prob_[1, :])
df_nbf['odds_positive'] = (df_nbf['pos']/df_nbf['neg'])*(prob_pos /prob_neg)
df_nbf['odds_negative'] = (df_nbf['neg']/df_nbf['pos'])*(prob_neg/prob_pos )
Most important words. This will give you a >1 ratio. For example a odds_ratio_negative =2 for the word "damn" means that this word is twice likely to occur when the comment or your class is negative in comparison with your positive class.
# Here are the top5 most important words of your positive class:
odds_pos_top5 = df_nbf.sort_values('odds_positive',ascending=False)['odds_positive'][:5]
# Here are the top5 most important words of your negative class:
odds_neg_top5 = df_nbf.sort_values('odds_negative',ascending=False)['odds_negative'][:5]
Upvotes: 7
Reputation: 5355
You can get the importantance of each word out of the fit model by using the coefs_
or feature_log_prob_
attributes. For example
neg_class_prob_sorted = NB_optimal.feature_log_prob_[0, :].argsort()[::-1]
pos_class_prob_sorted = NB_optimal.feature_log_prob_[1, :].argsort()[::-1]
print(np.take(count_vect.get_feature_names(), neg_class_prob_sorted[:10]))
print(np.take(count_vect.get_feature_names(), pos_class_prob_sorted[:10]))
Prints the top 10 most predictive words for each of your classes.
Upvotes: 26
Reputation: 7631
def get_salient_words(nb_clf, vect, class_ind):
"""Return salient words for given class
Parameters
----------
nb_clf : a Naive Bayes classifier (e.g. MultinomialNB, BernoulliNB)
vect : CountVectorizer
class_ind : int
Returns
-------
list
a sorted list of (word, log prob) sorted by log probability in descending order.
"""
words = vect.get_feature_names()
zipped = list(zip(words, nb_clf.feature_log_prob_[class_ind]))
sorted_zip = sorted(zipped, key=lambda t: t[1], reverse=True)
return sorted_zip
neg_salient_top_20 = get_salient_words(NB_optimal, count_vect, 0)[:20]
pos_salient_top_20 = get_salient_words(NB_optimal, count_vect, 1)[:20]
Upvotes: 4
Reputation: 210832
Try this:
pred_proba = NB_optimal.predict_proba(X_test)
words = np.take(count_vect.get_feature_names(), pred_proba.argmax(axis=1))
Upvotes: 2