Reputation: 151
Lets say we have a query that constitutes the following 4 strings w1,w2,w3 and w4
The pointwise mutual information(PMI) between two string is denoted as: p(w_i,w_j) = log(p(w_i,w_j)/(p(w_i)*p(w_j)))
To find the average PMI, one would naturally calculate the PMI for all the pairs and average it. But what do we do in cases where for the pairs in consideration, there are no common documents?
Ex: Lets say w1 and w2 have no common documents, which in turn means that p(w1,w2) = 0 and a PMI of Infinity. How do we take an average then? Do we neglect the pairs whose PMI is infinity? If we do neglect such pairs, then what should we do in cases where none of the strings in the query would have any common documents?
Upvotes: 1
Views: 329
Reputation: 7394
Standard answer: when estimating probabilities, smooth.
Thus assuming p(w_1) is the probability that a document contains w_1, if the query w_1 returns n_1 documents from N total, you switch your estimate for p(w_1) from:
n_1 / N (unsmoothed estimate, otherwise known as Maximum Likelihood)
to:
(n_1 + 1) / (n_2 + 2) (actually the posterior mean of the parameter assuming uniform prior).
This means you never get zeros anywhere. Similarly for empirical estimates of joint probability p(w_1, w_2), use:
(count(w_1 and w_2) + 1) / (N + 2)
Upvotes: 1