Reputation: 69
I wrote a simple document classifier and I am currently testing it on the Brown Corpus. However, my accuracy is still very low (0.16). I've already excluded stopwords. Any other ideas on how to improve the classifier's performance?
import nltk, random
from nltk.corpus import brown, stopwords
documents = [(list(brown.words(fileid)), category)
for category in brown.categories()
for fileid in brown.fileids(category)]
random.shuffle(documents)
stop = set(stopwords.words('english'))
all_words = nltk.FreqDist(w.lower() for w in brown.words() if w in stop)
word_features = list(all_words.keys())[:3000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
Upvotes: 0
Views: 1573
Reputation: 43
I would start by changing the first comment from:
import corpus documents = [(list(brown.words(fileid)), category) to:
documents = [(list(brown.words(fileid)), category) ...
In addition to changing the w.lower as the other answer says.
Changing this and following these two links below which implements a basic Naive Classifier without removing stop words gave me an accuracy of 33% which is a lot higher than 16%. https://pythonprogramming.net/words-as-features-nltk-tutorial/ https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/?completed=/words-as-features-nltk-tutorial/
There are lots of things you can try to see if it improves your accuracy:
1- removing stop words
2- removing punctuation
3- removing the most common words and the least common words
4- normalizing the text
5- stemming or lemmatizing the text
6- I think this feature-set gives True if the word is present and False if it is not present. You can implement a count or a frequency.
7- You can use unigrams, bigrams and trigrams or combinations of those.
Hope that helped
Upvotes: 0
Reputation: 50220
If that's really your code, it's a wonder you get anything at all. w.lower
is not a string, it's a function (method) object. You need to add the parentheses:
>>> w = "The"
>>> w.lower
<built-in method lower of str object at 0x10231e8b8>
>>> w.lower()
'the'
(But who knows really. You need to fix the code in your question, it's full of cut-and-paste errors and who knows what else. Next time, help us help you better.)
Upvotes: 2