Reputation: 23
I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?
Upvotes: 2
Views: 2966
Reputation: 380
Like what Maus was saying, NLTK Naive Bayes(NB) uses a Bernoulli model plus smoothing to control for feature conditional probabilities==0(for features not seen by the classifier in training) A common technique for smoothing is Laplace-smoothing where you add 1 to the numerator of the conditional probability, but I believe NLTK adds 0.5 to the numerator.The NLTK NB model uses boolean values and computes its conditionals based on that, so using tf-idf as a feature will not produce good or even meaningful results.
If you want to stay within NLTK, then you should use the words themselves as features and bigrams. Check out this article by Jacob Perkins on text processing with NB in NLTK: http://streamhacker.com/tag/information-gain/. This article does a great job explaining and demonstrating some of the things you can do to pre-process your data; it uses the movie reviews corpus from NLTK for sentiment classification.
There is another module Python for text processing called scikit-learn and that has various NB models in it like Multinomial NB, which uses the frequency each word instead of occurrence of each word for computing its conditional probabilities.
Here is some literature on NB and the how both the Multinomial and Bernoulli models work: http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html; navigate through the literature using the previous/next buttons on the webpage.
Upvotes: 0
Reputation: 1639
A few points that might help:
You may also find alternative weighting techniques such as log(1 + TF) * log(IDF)
will improve accuracy. Good luck!
Upvotes: 1
Reputation: 1843
If you are using the nltk naive bayes classifier, it's likely your actually using smoothed multi-variate bernoulli naive bayes text classification. This could be an issue if your feature extraction function maps into the set of all floating point values (which it sounds like it might since your using tf-idf) rather than the set of all boolean values.
If your feature extractor returns tf-idf values, then I think nltk.NaiveBayesClassifier will check if it is true that
tf-idf(word1_in_doc1) == tf-idf(word1_in_class1)
rather than the appropriate question for whatever continuous distribution is appropriate to tf-idf.
This could explain your low accuracy, especially if one category occurs 53% of the time in your training set.
You might want to check out the multinomial naive bayes classifier implemented in scikit-learn.
For more information on multinomial and multivariate Bernoulli classifiers, see this very readable paper.
Upvotes: 0
Reputation: 7380
There could be many reasons for the classifier not working, and there are many ways to tweak it.
So the question is rather broad. Maybe If you give more details You could get more relevant suggestions.
Upvotes: 0