AlinaOs
AlinaOs

Reputation: 35

How to deal with word counts of zero when calculating Pointwise Mutual Information (PMI) for word cooccurrences in Natural Language Processing

I have a co-occurrence matrix of words in a text (two words x and y are considered co-occurring, if they both occur in a context window of w words). I want to calculate the Pointwise Mutual Information for two words x and y which I do using the common formula

PMI(x, y) = log2(P(x, y) / P(x)*P(y)).

There may be cases, where P(x)*P(y) = 0, e.g.:

X Not X
Y 30 0
Not Y 0 1500

or

X Not X
Y 30 0
Not Y 1000 100

How do I handle such cases in order to avoid a math error in my Python Script (division by zero) as well as avoiding messing up the data?

I tried to find information on websites explaining PMI, but they don't mention this special case. Either this does not happen often (which I cannot believe, since there must be something like "perfect" PMI) or the solution to this is so trivial, that everyone knows it, but no one speaks about it. What can be done to handle the problem?

My ideas so far:

  1. Define what should happen in such a case and catch it with an if-clause, then manually assign the desired value. But this seems inexact to me and depends on many non-binary factors. E.g., in table one there is a total correlation, in table two the correlation is rather coincidental, given that nearly the whole corpus consists of x and y is bound to occur with it.
  2. Use some kind of additive smoothing as suggested in a comment on this thread, i.e. adding a positive value to all values involved in the calculation. But what should this value be in order to not skew the frequency distribution even for small corpora - 1, 0.1, 0.001, something totally different?

I would be glad about any hint for procedures usually accepted when working with PMI.


EDIT

It turns out that the problem I had was due to a misunderstanding of PMI on my side. In the above examples, P(x)*P(y) is not 0, because each word occurs at least 30 times. Not the probability of x and y occurring without the other word respectively is relevant, but the probability of them occurring at all, which includes the times that they occur with the other word.

In case that x and/or y never occur in the entire corpus, manually catching this problem (e.g., like @Mustafa suggest in their answer) might help.

Upvotes: 2

Views: 155

Answers (1)

Mustafa
Mustafa

Reputation: 39

In order to deal with division by zero in PMI(x, y) = log2(P(x, y) / P(x)*P(y)). You can implement the following in the PMI condition.

if prob_word1 == 0 or prob_word2 == 0 or prob_pair == 0:
    pointwise_mutual_information = float('-inf')
else:
    pointwise_mutual_information = math.log2(prob_pair / (prob_word1 * prob_word2))

Another thing you can do is to calculate PPMI, which is ppmi = max(pointwise_mutual_information, 0).

PPMI stands for Positive Pointwise Mutual Information (Jurafsky and Martin. Speech and Language Processing (3rd ed. draft), Chapter 6). This in case you only need positive values of PMI.

Upvotes: 0

Related Questions