Reputation: 175
I'm going to train a classifier on a sample dataset using n-gram
. I searched for related content and wrote the code below. As I'm a beginner in python, I have two questions.
1- Why should the dictionary have this 'True' structure (marked with comment)? Is this related to Naive Bayes Classifier input?
2- Which classifier do you recommend to do this task?
Any other suggestion to shorten the code are welcome :).
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk import ngrams
from nltk.classify import NaiveBayesClassifier
import nltk.classify.util
stoplist = set(stopwords.words("english"))
def stopword_removal(words):
useful_words = [word for word in words if word not in stoplist]
return useful_words
def create_ngram_features(words, n):
ngram_vocab = ngrams(words, n)
my_dict = dict([(ng, True) for ng in ngram_vocab]) # HERE
return my_dict
for n in [1,2]:
positive_data = []
for fileid in movie_reviews.fileids('pos'):
words = stopword_removal(movie_reviews.words(fileid))
positive_data.append((create_ngram_features(words, n), "positive"))
print('\n\n---------- Positive Data Sample----------\n', positive_data[0])
negative_data = []
for fileid in movie_reviews.fileids('neg'):
words = stopword_removal(movie_reviews.words(fileid))
negative_data.append((create_ngram_features(words, n), "negative"))
print('\n\n---------- Negative Data Sample ----------\n', negative_data[0])
train_set = positive_data[:100] + negative_data[:100]
test_set = positive_data[100:] + negative_data[100:]
classifier = NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.util.accuracy(classifier, test_set)
print('\n', str(n)+'-gram accuracy:', accuracy)
Upvotes: 0
Views: 590
Reputation: 366
Before data training, you need to transform your n-grams into matrix of codes with size <number_of_documents, max_document_representation_length>. For example, document representation is a bag-of-words where each word/n-gram of a corpus dictionary has its frequency in a document.
Naive Bayes classifier is the most simple classifier. But it works bad on noisy data and needs balanced data classes' distribution for training. You can try to use any boosting classifier, for example, gradient boosting machine or support vector machine.
All classifiers and transformers are available in scikit-learn
library.
Upvotes: 1