Calculating mutual information in python returns nan

Question

I've implemented the mutual information formula in python using pandas and numpy

def mutual_info(p):
    p_x=p.sum(axis=1)
    p_y=p.sum(axis=0)
    I=0.0
    for i_y in p.index:
        for i_x in p.columns:
           I+=(p.ix[i_y,i_x]*np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x]))).values[0]
    return I

However, if a cell in p has a zero probability, then np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x])) is negative infinity, and the whole expression is multiplied by zero and returns NaN.

What is the right way to work around that?

Ami Tavory · Accepted Answer

For various theoretical and practical reasons (e.g., see Competitive Distribution Estimation: Why is Good-Turing Good), you might consider never using a zero probability with the log loss measure.

So, say, if you have a probability vector p, then, for some small scalar α > 0, you would use α 1 + (1 - α) p (where here the first 1 is the uniform vector). Unfortunately, there are no general guidelines for choosing α, and you'll have to assess this further down the calculation.

For the Kullback-Leibler distance, you would of course apply this to each of the inputs.

Calculating mutual information in python returns nan

Answers (1)

Related Questions