Reputation: 53
I really need help to understand the process of probability estimating. So I calculated the count of bigrams in a corpus:
import nltk
bigram_p = {}
for sentence in corpus:
tokens = sentence.split()
tokens = [START_SYMBOL] + tokens #Add a start symbol
#so the first word would count as bigram
bigrams = (tuple(nltk.bigrams(tokens)))
for bigram in bigrams:
if bigram not in bigram_p:
bigram_p[bigram] = 1
else:
bigram_p[bigram] += 1
for bigram in bigram_p:
if bigram[0] == '*':
bigram_p[bigram] = math.log(bigram_p[bigram]/unigram_p[('STOP',)],2)
else:
bigram_p[bigram] = math.log(bigram_p[bigram]/unigram_p[(word[0],)],2)
but I get a KeyError - math domain error - and I can't understand why. Pleas explain to me my error and what to do with it.
Upvotes: 2
Views: 5296
Reputation: 4118
I assume you are getting that error in some of the math.log
lines. That error only means that you are passing an argument which doesn't have a log
operation defined, e.g.
import math
# Input is zero
math.log(0) # ValueError: math domain error
# Input is negative
math.log(-1) # ValueError: math domain error
One of your expresions bigram_p[bigram]/unigram_p[('STOP',)]
or math.log(bigram_p[bigram]/unigram_p[(word[0],)]
is producing a zero or negative input.
Note that division operator (/
) in python 2.7 is an integer division, so results are truncated to integer if both arguments are integer:
1 / 2 # => 0, because 1 and 2 are integers
1. / 2 # => 0.5, because 1. is a float
1.0 / 2 # => 0.5, because 1.0 is a float
If you want a more intuitive behavior of the division operaror, add to your file,
from __future__ import division
Here are the docs for that import if you want to understand more about it.
EDIT:
If you can't/don't want to use the import trick, you can convert to float a number either by multiplying by a float n * 1.0
or with a built-in function float(n)
.
Upvotes: 2