Reputation: 252
I've implemented the mutual information formula in python using pandas
and numpy
def mutual_info(p):
p_x=p.sum(axis=1)
p_y=p.sum(axis=0)
I=0.0
for i_y in p.index:
for i_x in p.columns:
I+=(p.ix[i_y,i_x]*np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x]))).values[0]
return I
However, if a cell in p
has a zero probability, then np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x]))
is negative infinity, and the whole expression is multiplied by zero and returns NaN
.
What is the right way to work around that?
Upvotes: 4
Views: 2746
Reputation: 76317
For various theoretical and practical reasons (e.g., see Competitive Distribution Estimation: Why is Good-Turing Good), you might consider never using a zero probability with the log loss measure.
So, say, if you have a probability vector p, then, for some small scalar α > 0, you would use α 1 + (1 - α) p (where here the first 1 is the uniform vector). Unfortunately, there are no general guidelines for choosing α, and you'll have to assess this further down the calculation.
For the Kullback-Leibler distance, you would of course apply this to each of the inputs.
Upvotes: 3