Reputation: 35
I have a co-occurrence matrix of words in a text (two words x and y are considered co-occurring, if they both occur in a context window of w words). I want to calculate the Pointwise Mutual Information for two words x and y which I do using the common formula
PMI(x, y) = log2(P(x, y) / P(x)*P(y))
.
There may be cases, where P(x)*P(y) = 0
, e.g.:
X | Not X | |
---|---|---|
Y | 30 | 0 |
Not Y | 0 | 1500 |
or
X | Not X | |
---|---|---|
Y | 30 | 0 |
Not Y | 1000 | 100 |
How do I handle such cases in order to avoid a math error in my Python Script (division by zero) as well as avoiding messing up the data?
I tried to find information on websites explaining PMI, but they don't mention this special case. Either this does not happen often (which I cannot believe, since there must be something like "perfect" PMI) or the solution to this is so trivial, that everyone knows it, but no one speaks about it. What can be done to handle the problem?
My ideas so far:
I would be glad about any hint for procedures usually accepted when working with PMI.
EDIT
It turns out that the problem I had was due to a misunderstanding of PMI on my side. In the above examples, P(x)*P(y) is not 0, because each word occurs at least 30 times. Not the probability of x and y occurring without the other word respectively is relevant, but the probability of them occurring at all, which includes the times that they occur with the other word.
In case that x and/or y never occur in the entire corpus, manually catching this problem (e.g., like @Mustafa suggest in their answer) might help.
Upvotes: 2
Views: 155
Reputation: 39
In order to deal with division by zero in PMI(x, y) = log2(P(x, y) / P(x)*P(y))
. You can implement the following in the PMI condition.
if prob_word1 == 0 or prob_word2 == 0 or prob_pair == 0:
pointwise_mutual_information = float('-inf')
else:
pointwise_mutual_information = math.log2(prob_pair / (prob_word1 * prob_word2))
Another thing you can do is to calculate PPMI, which is ppmi = max(pointwise_mutual_information, 0)
.
PPMI stands for Positive Pointwise Mutual Information (Jurafsky and Martin. Speech and Language Processing (3rd ed. draft), Chapter 6). This in case you only need positive values of PMI.
Upvotes: 0