Reputation: 53
I’m new in Python and I’m trying to see the normalized mutual information between 2 different signals, and no matter what signals I use, the result I obtain is always 1, which I believe it’s impossible because the signals are different and not totally correlated.
I’m using the Normalized Mutual Information Function provided Scikit Learn: sklearn.metrics.normalized mutualinfo_score(labels_true, labels_pred).
Here’s the code I’m using:
from numpy.random import randn
from numpy import *
from matplotlib.pyplot import *
from sklearn.metrics.cluster import normalized_mutual_info_score as mi
import pandas as pd
def fzX(X):
''' z-scoring columns'''
if len(X.shape)>1:
'''X is matrix ... more vars'''
meanX=mean(X,0)
stdX=std(X,0)
stdX[stdX<1e-9]=0
zX=zeros(X.shape)
for i in range(X.shape[1]):
if stdX[i]>0:
zX[:,i]=(X[:,i]-meanX[i])/stdX[i]
else:
zX[:,i]=0
else:
'''X is vector ... more vars'''
meanX=mean(X)
stdX=std(X,0)
zX=(X-meanX)/stdX
return(zX,meanX,stdX)
def fMI(X):
'''vars in columns,
returns mut info of normalized data'''
zX,meanX,stdX=fzX(X)
n=X.shape[1]
Mut_Info=zeros((n,n))
for i in range(n):
for j in range(i,n):
Mut_Info[i,j]=mi(zX[:,i],zX[:,j])
Mut_Info[j,i]=Mut_Info[i,j]
plot(zX);show()
return(Mut_Info)
t=arange(0,100,0.1) # t=0:0.1:99.9
N=len(t) # number of samples in t
u=sin(2*pi*t)+(randn(N)*2)**2
y=(cos(2*pi*t-2))**2+randn(N)*2
X=zeros((len(u),2))
X[:,0]=u
X[:,1]=y
mut=fMI(X)
print mut
plot(X)
show()
Did anyone of you have similar problem before? Do you know what I’m doing wrong?
Thank you very much in advance for your dedicated time.
Upvotes: 5
Views: 12035
Reputation: 151007
Your floating point data can't be used this way -- normalized_mutual_info_score
is defined over clusters. The function is going to interpret every floating point value as a distinct cluster. And if you look back at the documentation, you'll see that the function throws out information about cluster labels. After all, the labels themselves are arbitrary, so anti-correlated labels have as much mutual information as correlated labels.
Examples
Here are a couple of examples based directly on the documentation:
>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 1, 0, 0])
1.0
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 0, 1, 1])
1.0
See how the labels are perfectly correlated in the first case, and perfectly anti-correlated in the second? But in both cases, the mutual information is 1.0
. The same pattern continues for partially correlated values:
>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 0, 1, 1])
0.34559202994421129
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 1, 0, 0])
0.34559202994421129
Swapping the labels just in the second sequence has no effect. And again, this time with floating point values:
>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.1, 0.1, 0.1, 0.5])
0.34559202994421129
>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.5, 0.5, 0.5, 0.1])
0.34559202994421129
So having seen all that, this shouldn't seem so surprising:
>>> normalized_mutual_info_score([0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8])
1.0
Each floating point is considered its own label, but the labels are themselves arbitrary. So the function can't tell any difference between the two sequences of labels, and returns 1.0
.
Working with floating point data
If you're starting out with floating point data, and you need to do this calculation, you probably want to assign cluster labels, perhaps by putting points into bins using two different schemes.
For example, in the first scheme, you could put every value p <= 0.5
in cluster 0
and p > 0.5
in cluster 1
. Then, in the second scheme, you could put every value p <= 0.4
in cluster 0
and p > 0.4
in cluster 1
. These clusterings would mostly overlap; the points where they did not would cause the mutual information score to go down.
There are other possible clustering schemes -- I'm not quite sure what your goal is, so I can't give more concrete advice than that.
Upvotes: 9