Reputation: 1008
I would like to compare to partitions/clusterings (P1 and P2) of a set S of different sizes. Example:
S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
From what I have read mutual information could be an approach and it is implemented in scikit-learn. From the definition it does not state that the partitions must be of the same size (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html).l
However when I try to implement my code I get error due to different size.
from sklearn import metrics
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
metrics.mutual_info_score(P1,P2)
ValueErrorTraceback (most recent call last)
<ipython-input-183-d5cb8d32ce7d> in <module>()
2 P2 = [ [1,2,3,4], [5, 6]]
3
----> 4 metrics.mutual_info_score(P1,P2)
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in mutual_info_score(labels_true, labels_pred, contingency)
556 """
557 if contingency is None:
--> 558 labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
559 contingency = contingency_matrix(labels_true, labels_pred)
560 contingency = np.array(contingency, dtype='float')
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in check_clusterings(labels_true, labels_pred)
34 if labels_true.ndim != 1:
35 raise ValueError(
---> 36 "labels_true must be 1D: shape is %r" % (labels_true.shape,))
37 if labels_pred.ndim != 1:
38 raise ValueError(
ValueError: labels_true must be 1D: shape is (3, 2)
Is there a form to use scikit-learn and mutual information to see how close this partitions are? Otherwise, is there one without using mutual information?
Upvotes: 3
Views: 6263
Reputation: 1008
The error is in the form the information is passed into the function. The correct form is giving a list of labels for each element of the global set to partition. In this case one label for each element in S. Each label should correspond to the cluster in which it belongs to, therefore elements with same label are in same cluster. To solve the example:
S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
labs_1 = [ 1, 1, 2, 2, 3, 3]
labs_2 = [1, 1, 1, 1, 2, 2]
metrics.mutual_info_score(labs_1, labs_2)
answer is then:
0.636514168294813
If we want to compute mutual information for the format of partitions originally given then one can use the following code:
from sklearn import metrics
from __future__ import division
import numpy as np
S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
set_partition1 = [set(p) for p in P1]
set_partition2 = [set(p) for p in P2]
def prob_dist(clustering, cluster, N):
return len(clustering[cluster])/N
def prob_joint_dist(clustering1, clustering2, cluster1, cluster2, N):
'''
N(int): total number of elements.
clustering1(list): first partition
clustering2(list): second partition
cluster1(int): index of cluster of the first partition
cluster2(int): index of cluster of second partition
'''
c1 = clustering1[cluster1]
c2 = clustering2[cluster2]
n_ij = len(set(c1).intersection(c2))
return n_ij/N
def mutual_info(clustering1, clustering2, N):
'''
clustering1(list): first partition
clustering2(list): second partition
Note for it to work division from __future__ must be imported
'''
n_clas = len(clustering1)
n_com = len(clustering2)
mutual_info = 0
for i in range(n_clas):
for j in range(n_com):
p_i = prob_dist(clustering1, i, N)
p_j = prob_dist(clustering2, j, N)
R_ij = prob_joint_dist(clustering1, clustering2, i, j, N)
if R_ij > 0:
mutual_info += R_ij*np.log( R_ij / (p_i * p_j))
return mutual_info
mutual_info(set_partition1, set_partition2, len(S))
which gives the same answer:
0.63651416829481278
Note that we are using natural logarithm and not log2. The code can be easily adapted though.
Upvotes: 1