Reputation: 139
I use Naive Bayes from Weka to do text classification. I have two classes for my sentences, "Positive" and "Negative". I collected about 207 sentences with positive meaning and 189 sentences with negative meaning, in order to create my training set.
When I ran Naive Bayes with a test set that contains sentences with strong negative meaning, such as the one of the word "hate", the accuracy of the results is pretty good, about 88%. But when I use sentences with positive meaning, such as the one of the word "love", as a test set, the accuracy is much worse, about 56%.
I think that this difference probably has something to do with my training set and especially its "Positive" sentences.
Can you think of any reason that could explain this difference? Or maybe a way to help me find out where the problem begins?
Thanks a lot for your time,
Nantia
Upvotes: 1
Views: 1246
Reputation: 16114
To better understand how your classifier works, you can inspect the parameters to see which words the classifier thinks are the most predictive of positive/negative of sentence. Can you print out the top predictors for positive and negative cases?
e.g.,
top positive predictors:
p('love'|positive) = 0.05
p('like'|positive) = 0.016
...
top negative predictors:
p('hate'|negative) = 0.25
p('dislike'|negative) = 0.17
...
Upvotes: 1
Reputation: 19169
It may be that your negative sentences have words that are more consistently present, whereas your positive sentences have more variations in the words that are present or those words may also often be present in the negative sentences.
It is hard to give specific advice without knowing the size of your dictionary (i.e., number of attributes), size of your test set, etc. Since the Naive Bayes Classifier calculates the product of the probabilities of individual words being present or absent, I would take some of the misclassified positive examples and examine the conditional probabilities for both positive and negative classification to see why the examples are being misclassified.
Upvotes: 1
Reputation: 6271
Instead of creating test sets which contain only positive or negative samples I would just create a test set with mixed samples. You can the view the resulting confusion matrix in Weka which allows you to see how well both the positive and negative samples where classified. Furthermore I would use (10-fold) cross-validation to get a more stable measure of the performance (once you have done this you might want to edit your post with the confusion matrix cross-validation results and we might be able to help out more).
Upvotes: 1