PEREZje
PEREZje

Reputation: 2502

Calculate cluster accuracy of two clustering outcomes

So say I have two clustering outcomes that look like this:

clustering = [[8, 9, 10, 11], [14, 13, 4, 7, 6, 12, 5, 15], [1, 2, 0, 3]]
correct_clustering = [[2, 8, 10, 0, 15], [12, 13, 9, 14], [11, 3, 5, 1, 4, 6, 7]]

How would I go about comparing the outcome contained in clustering to the one contained in correct_clustering. I want to have some number between 0 and 1. I was thinking about calculating the fraction of pairs which are correctly clustered together in the same cluster. But can't think of a programmatic way to solve this.

Upvotes: 2

Views: 1008

Answers (3)

Mykola Zotko
Mykola Zotko

Reputation: 17902

You can use the function adjusted_rand_score in sklearn:

from sklearn.metrics import adjusted_rand_score

clustering = sorted((i, num) for num, lst in enumerate(clustering) for i in lst)
clustering = [i for _, i in clustering]
# [2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]

correct_clustering = sorted((i, num) for num, lst in enumerate(correct_clustering) for i in lst)
correct_clustering = [i for _, i in correct_clustering]
# [0, 2, 0, 2, 2, 2, 2, 2, 0, 1, 0, 2, 1, 1, 1, 0]

ari = adjusted_rand_score(correct_clustering, clustering)
# -0.012738853503184737

The function returns values between 1 and -1 so to get a value between 0 and 1 you need to rescale:

ari_scaled = (ari + 1) / 2
# 0.49363057324840764

Upvotes: -1

seralouk
seralouk

Reputation: 33147

Use the Rand Index:

import numpy as np
from scipy.special import comb

def rand_index_score(clusters, classes):

    tp_plus_fp = comb(np.bincount(clusters), 2).sum()
    tp_plus_fn = comb(np.bincount(classes), 2).sum()
    A = np.c_[(clusters, classes)]
    tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
             for i in set(clusters))
    fp = tp_plus_fp - tp
    fn = tp_plus_fn - tp
    tn = comb(len(A), 2) - tp - fp - fn
    return (tp + tn) / (tp + fp + fn + tn)

clusters = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

classes = [0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 2, 2, 2, 0]

rand_index_score(clusters, classes)
0.6764705882352942

enter image description here

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

The best practice measures are indeed based on pair counting.

In particular the adjusted Rand index (ARI) is the standard measure here.

You don't actually count pairs, but the number of pairs from a set can trivially be computed using the binomial, simply (n*(n-1))>>2.

You'll need this for each cluster and each cluster intersection.

The results of all intersections are aggregated, and it is easy to see that this is invariant to the permutation of clusters (and hence to the cluster labels). The Rand index is the accuracy of predicting whether two objects a, b are in the same cluster, or in different clusters. The ARI improves this by adjusting for chance: in a very unbalanced problem, a random result can score a high accuracy, but in ARI it is close to 0 on average.

Upvotes: 2

Related Questions