Sherlocked
Sherlocked

Reputation: 97

How to append values to generator while using bigrams in conditionalFreqDist method in python?

Context: I'm using NLTK to generate bigram probabilities. I have a corpus from which I have generated bigrams. -> 'wordPairsBigram' refers to the bigram from the corpus. I have a sentence "The company chairman said he will increase the profit next year". -> 'wordPairSentence' refers to the bigrams in the above sentence.

The Problem: I need to generate bigram probabilities. For that I need to find conditional Frequency Distribution of the sample sentence which I will pass onto the ConditionalProbDist function. I have the following code which calculates the conditional Frequency of the bigrams of the sentence that are available in the corpus.

fdListSentence1 = ConditionalFreqDist(wordBigram for wordBigram in wordPairsBigram if wordBigram in wordPairSentence1 )
print fdListSentence1.tabulate()

output:
        company   he said will year
     The    8    0    0    0    0
chairman    0    0    7    0    0
      he    0    0    0    2    0
    next    0    0    0    0    5
    said    0   21    0    0    0

The issue The code works fine for all the bigrams that are available in the corpus and the sample sentence. There are a few bigrams that are there in Sample sentence but not there in the corpus. They dont get included while calculating the frequency distribution.

What I want? I want the frequency distribution for the bigrams in the sentence. If the bigram in the sentence is not there in corpus bigram, I want a value 0 while tabulating.

Any help is appreciated. I dont know how to include what I want in the code.

Upvotes: 0

Views: 792

Answers (1)

alvas
alvas

Reputation: 122112

What you're trying to do is to smoothen the distribution. There are various ways of smoothing, see https://en.wikipedia.org/wiki/Smoothing.

Here's how one way of additive smoothing:

from nltk.corpus import brown
from nltk.util import bigrams
from nltk.probability import ConditionalFreqDist
from itertools import chain

train = brown.sents()[:100]
test = brown.sents()[101:110]

cfd = ConditionalFreqDist()
train_bigrams = list(chain(*[bigrams(i) for i in train]))
for bg in train_bigrams:
    cfd[bg[0]].inc(bg[1])

# Or if you prefer a one-liner. 
cfd = ConditionalFreqDist((bg[0],bg[1]) for bg in list(chain(*[bigrams(i) for i in train])))


for bg in list(chain(*[bigrams(i) for i in test])):
    prob = cfd[bg[0]].freq(bg[1])
    prob = 0.0001 if not prob else prob
    print bg, prob

[out]:

('said', 'it') 0.125
('it', 'would') 0.0001
('would', 'force') 0.111111111111
('force', 'banks') 0.0001
('banks', 'to') 0.0001
('to', 'violate') 0.0001
('violate', 'their') 0.0001
('their', 'contractual') 0.0001
('contractual', 'obligations') 0.0001
('obligations', 'with') 0.0001
('with', 'depositors') 0.0001
('depositors', 'and') 0.0001
('and', 'undermine') 0.0001
('undermine', 'the') 0.0001
('the', 'confidence') 0.0001
('confidence', 'of') 0.0001
('of', 'bank') 0.0001
('bank', 'customers') 0.0001
('customers', '.') 0.0001
('``', 'If') 0.0001
('If', 'you') 0.0001
('you', 'destroy') 0.0001
('destroy', 'confidence') 0.0001
('confidence', 'in') 0.0001
('in', 'banks') 0.0001
('banks', ',') 0.0001
(',', 'you') 0.0001
('you', 'do') 0.0001
('do', 'something') 0.0001
('something', 'to') 0.0001
('to', 'the') 0.0727272727273
('the', 'economy') 0.0001
('economy', "''") 0.0001
("''", ',') 0.205882352941
(',', 'he') 0.0001
('he', 'said') 0.0001
('said', '.') 0.166666666667

Upvotes: 1

Related Questions