Kuka
Kuka

Reputation: 141

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.

I could use the function score() to compute the log probability under the model.

However, I am looking for a metric called 'purity' which is defined in this article.

How can I implement it in Python? My current implementation looks like this:

from sklearn.mixture import GMM

# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)

clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)

# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)

But I can not loop through each cluster in order to compute the confusion matrix (according this question)

Upvotes: 14

Views: 33858

Answers (4)

Bai Li
Bai Li

Reputation: 1138

The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.

For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.

Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:

import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment

def cluster_accuracy(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)

    # Find optimal one-to-one mapping between cluster labels and true labels
    row_ind, col_ind = linear_sum_assignment(-contingency_matrix)

    # Return cluster accuracy
    return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

Upvotes: 0

Ugurite
Ugurite

Reputation: 543

David's answer works but here is another way to do it.

import numpy as np
from sklearn import metrics

def purity_score(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    # return purity
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix) 

Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".

Upvotes: 26

David
David

Reputation: 116

A very late contribution.

You can try to implement it like this, pretty much like in this gist

def purity_score(y_true, y_pred):
    """Purity score
        Args:
            y_true(np.ndarray): n*1 matrix Ground truth labels
            y_pred(np.ndarray): n*1 matrix Predicted clusters

        Returns:
            float: Purity score
    """
    # matrix which will hold the majority-voted labels
    y_voted_labels = np.zeros(y_true.shape)
    # Ordering labels
    ## Labels might be missing e.g with set like 0,2 where 1 is missing
    ## First find the unique labels, then map the labels to an ordered set
    ## 0,2 should become 0,1
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true==labels[k]] = ordered_labels[k]
    # Update unique labels
    labels = np.unique(y_true)
    # We set the number of bins to be n_classes+2 so that 
    # we count the actual occurence of classes between two consecutive bins
    # the bigger being excluded [bin_i, bin_i+1[
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_pred):
        hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
        # Find the most present label in the cluster
        winner = np.argmax(hist)
        y_voted_labels[y_pred==cluster] = winner

    return accuracy_score(y_true, y_voted_labels)

Upvotes: 4

kdbanman
kdbanman

Reputation: 10587

sklearn doesn't implement a cluster purity metric. You have 2 options:

  1. Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.

  2. Use the (much less mature) PML library, which does implement cluster purity.

Upvotes: 5

Related Questions