Reputation: 31
I have the following code which estimates the probability that a string of text belongs to a particular class (either positive or negative).
import pickle
from nltk.util import ngrams
classifier0 = open("C:/Users/ned/Desktop/gherkin.pickle","rb")
classifier = pickle.load(classifier0)
words = ['boring', 'and', 'stupid', 'movie']
feats = dict([(word, True) for word in words])
classifier.classify(feats)
probs = classifier.prob_classify(feats)
for sample in ('neg', 'pos'):
print('%s probability: %s' % (sample, probs.prob(sample)))
It yields the following:
neg probability: 0.944
pos probability: 0.055
[Finished in 24.7s]
The pickled classifier which I am loading already makes use of n-grams.
My question is:
How can I edit this code so that n-grams are incorporated into the probability estimate?
Upvotes: 1
Views: 925
Reputation: 795
Depending on the N-Gram classifier (with n used for training) you can generate the n-grams and classify them with the classifier, obtaining those probabilities.
To generate the new instances, use this example: (only for bi-grams and tri-grams).
import nltk
words = nltk.word_tokenize(text) # or your list
bigrams = nltk.bigrams(words)
trigrams = nltk.trigrams(words)
Upvotes: 0
Reputation: 4824
Add the ngrams to your feature dict...
import pickle
from nltk.util import ngrams
fin = open("C:/Users/ned/Desktop/gherkin.pickle","rb")
classifier = pickle.load(fin)
words = ['boring', 'and', 'stupid', 'movie']
ngram_list = words + list(ngrams(words, 2)) + list(ngrams(words, 3))
feats = dict([(word, True) for word in ngram_list])
dist = classifier.prob_classify(feats)
for sample in dist.samples():
print("%s probability: %f" % (sample, dist.prob(sample)))
Example output...
$ python movie-classifer-example.py
neg probability: 0.999138
pos probability: 0.000862
Upvotes: 2