RM-
RM-

Reputation: 1008

Python scikit-learn implementation of mutual information not working for partitions of different size

I would like to compare to partitions/clusterings (P1 and P2) of a set S of different sizes. Example:

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]

From what I have read mutual information could be an approach and it is implemented in scikit-learn. From the definition it does not state that the partitions must be of the same size (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html).l

However when I try to implement my code I get error due to different size.

from sklearn import metrics
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
metrics.mutual_info_score(P1,P2)


ValueErrorTraceback (most recent call last)
<ipython-input-183-d5cb8d32ce7d> in <module>()
      2 P2 = [ [1,2,3,4], [5, 6]]
      3 
----> 4 metrics.mutual_info_score(P1,P2)

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in mutual_info_score(labels_true, labels_pred, contingency)
    556     """
    557     if contingency is None:
--> 558         labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
    559         contingency = contingency_matrix(labels_true, labels_pred)
    560     contingency = np.array(contingency, dtype='float')

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/cluster/supervised.pyc in check_clusterings(labels_true, labels_pred)
     34     if labels_true.ndim != 1:
     35         raise ValueError(
---> 36             "labels_true must be 1D: shape is %r" % (labels_true.shape,))
     37     if labels_pred.ndim != 1:
     38         raise ValueError(

ValueError: labels_true must be 1D: shape is (3, 2)

Is there a form to use scikit-learn and mutual information to see how close this partitions are? Otherwise, is there one without using mutual information?

Upvotes: 3

Views: 6263

Answers (1)

RM-
RM-

Reputation: 1008

The error is in the form the information is passed into the function. The correct form is giving a list of labels for each element of the global set to partition. In this case one label for each element in S. Each label should correspond to the cluster in which it belongs to, therefore elements with same label are in same cluster. To solve the example:

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
labs_1 = [ 1, 1, 2, 2, 3, 3]
labs_2 = [1, 1, 1, 1, 2, 2]
metrics.mutual_info_score(labs_1, labs_2)

answer is then:

0.636514168294813

If we want to compute mutual information for the format of partitions originally given then one can use the following code:

from sklearn import metrics
from __future__ import division
import numpy as np

S = [1, 2, 3, 4, 5, 6]
P1 = [[1, 2], [3,4], [5,6]]
P2 = [ [1,2,3,4], [5, 6]]
set_partition1 = [set(p) for p in P1]
set_partition2 = [set(p) for p in P2]

def prob_dist(clustering, cluster, N):
    return len(clustering[cluster])/N

def prob_joint_dist(clustering1, clustering2, cluster1, cluster2, N):
    '''
    N(int): total number of elements.
    clustering1(list): first partition
    clustering2(list): second partition
    cluster1(int): index of cluster of the first partition
    cluster2(int): index of cluster of second partition
    '''
    c1 = clustering1[cluster1]
    c2 = clustering2[cluster2]
    n_ij = len(set(c1).intersection(c2))
    return n_ij/N

def mutual_info(clustering1, clustering2, N):
    '''
    clustering1(list): first partition
    clustering2(list): second partition
    Note for it to work division from  __future__ must be imported
    '''
    n_clas = len(clustering1)
    n_com = len(clustering2)
    mutual_info = 0
    for i in range(n_clas):
        for j in range(n_com):
            p_i = prob_dist(clustering1, i, N)
            p_j = prob_dist(clustering2, j, N)
            R_ij = prob_joint_dist(clustering1, clustering2, i, j, N)
            if R_ij > 0:
                mutual_info += R_ij*np.log( R_ij / (p_i * p_j))
    return mutual_info

mutual_info(set_partition1, set_partition2, len(S))

which gives the same answer:

0.63651416829481278

Note that we are using natural logarithm and not log2. The code can be easily adapted though.

Upvotes: 1

Related Questions