ECub Devs
ECub Devs

Reputation: 175

How to work with n-grams for classification tasks?

I'm going to train a classifier on a sample dataset using n-gram. I searched for related content and wrote the code below. As I'm a beginner in python, I have two questions.

1- Why should the dictionary have this 'True' structure (marked with comment)? Is this related to Naive Bayes Classifier input?

2- Which classifier do you recommend to do this task?

Any other suggestion to shorten the code are welcome :).

from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk import ngrams
from nltk.classify import NaiveBayesClassifier
import nltk.classify.util


stoplist = set(stopwords.words("english"))


def stopword_removal(words):
    useful_words = [word for word in words if word not in stoplist]
    return useful_words


def create_ngram_features(words, n):
    ngram_vocab = ngrams(words, n)
    my_dict = dict([(ng, True) for ng in ngram_vocab])  # HERE
    return my_dict


for n in [1,2]:
    positive_data = []
    for fileid in movie_reviews.fileids('pos'):
        words = stopword_removal(movie_reviews.words(fileid))
        positive_data.append((create_ngram_features(words, n), "positive"))
    print('\n\n---------- Positive Data Sample----------\n', positive_data[0])

    negative_data = []
    for fileid in movie_reviews.fileids('neg'):
        words = stopword_removal(movie_reviews.words(fileid))
        negative_data.append((create_ngram_features(words, n), "negative"))
    print('\n\n---------- Negative Data Sample ----------\n', negative_data[0])

    train_set = positive_data[:100] + negative_data[:100]
    test_set = positive_data[100:] + negative_data[100:]

    classifier = NaiveBayesClassifier.train(train_set)

    accuracy = nltk.classify.util.accuracy(classifier, test_set)
    print('\n', str(n)+'-gram accuracy:', accuracy)

Upvotes: 0

Views: 590

Answers (1)

roddar92
roddar92

Reputation: 366

Before data training, you need to transform your n-grams into matrix of codes with size <number_of_documents, max_document_representation_length>. For example, document representation is a bag-of-words where each word/n-gram of a corpus dictionary has its frequency in a document.

Naive Bayes classifier is the most simple classifier. But it works bad on noisy data and needs balanced data classes' distribution for training. You can try to use any boosting classifier, for example, gradient boosting machine or support vector machine.

All classifiers and transformers are available in scikit-learn library.

Upvotes: 1

Related Questions