Repzz
Repzz

Reputation: 53

Bigram and trigram probability python

I really need help to understand the process of probability estimating. So I calculated the count of bigrams in a corpus:

import nltk
bigram_p = {}

for sentence in corpus:
    tokens = sentence.split()
    tokens = [START_SYMBOL] + tokens #Add a start symbol 
    #so the first word would count as bigram
    bigrams = (tuple(nltk.bigrams(tokens)))
    for bigram in bigrams:
        if bigram not in bigram_p:
           bigram_p[bigram] = 1
        else:
           bigram_p[bigram] += 1

        for bigram in bigram_p:
            if bigram[0] == '*':  
                bigram_p[bigram] = math.log(bigram_p[bigram]/unigram_p[('STOP',)],2)
            else:
                bigram_p[bigram] = math.log(bigram_p[bigram]/unigram_p[(word[0],)],2)

but I get a KeyError - math domain error - and I can't understand why. Pleas explain to me my error and what to do with it.

Upvotes: 2

Views: 5296

Answers (1)

memoselyk
memoselyk

Reputation: 4118

I assume you are getting that error in some of the math.log lines. That error only means that you are passing an argument which doesn't have a log operation defined, e.g.

import math

# Input is zero
math.log(0)  # ValueError: math domain error

# Input is negative
math.log(-1)  # ValueError: math domain error

One of your expresions bigram_p[bigram]/unigram_p[('STOP',)] or math.log(bigram_p[bigram]/unigram_p[(word[0],)] is producing a zero or negative input.

Note that division operator (/) in python 2.7 is an integer division, so results are truncated to integer if both arguments are integer:

1 / 2    # => 0, because 1 and 2 are integers
1. / 2   # => 0.5, because 1. is a float
1.0 / 2  # => 0.5, because 1.0 is a float 

If you want a more intuitive behavior of the division operaror, add to your file,

from __future__ import division

Here are the docs for that import if you want to understand more about it.

EDIT:

If you can't/don't want to use the import trick, you can convert to float a number either by multiplying by a float n * 1.0 or with a built-in function float(n).

Upvotes: 2

Related Questions