Reputation: 2502
So say I have two clustering outcomes that look like this:
clustering = [[8, 9, 10, 11], [14, 13, 4, 7, 6, 12, 5, 15], [1, 2, 0, 3]]
correct_clustering = [[2, 8, 10, 0, 15], [12, 13, 9, 14], [11, 3, 5, 1, 4, 6, 7]]
How would I go about comparing the outcome contained in clustering
to the one contained in correct_clustering
. I want to have some number between 0 and 1. I was thinking about calculating the fraction of pairs which are correctly clustered together in the same cluster. But can't think of a programmatic way to solve this.
Upvotes: 2
Views: 1008
Reputation: 17902
You can use the function adjusted_rand_score
in sklearn
:
from sklearn.metrics import adjusted_rand_score
clustering = sorted((i, num) for num, lst in enumerate(clustering) for i in lst)
clustering = [i for _, i in clustering]
# [2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
correct_clustering = sorted((i, num) for num, lst in enumerate(correct_clustering) for i in lst)
correct_clustering = [i for _, i in correct_clustering]
# [0, 2, 0, 2, 2, 2, 2, 2, 0, 1, 0, 2, 1, 1, 1, 0]
ari = adjusted_rand_score(correct_clustering, clustering)
# -0.012738853503184737
The function returns values between 1 and -1 so to get a value between 0 and 1 you need to rescale:
ari_scaled = (ari + 1) / 2
# 0.49363057324840764
Upvotes: -1
Reputation: 33147
Use the Rand Index:
import numpy as np
from scipy.special import comb
def rand_index_score(clusters, classes):
tp_plus_fp = comb(np.bincount(clusters), 2).sum()
tp_plus_fn = comb(np.bincount(classes), 2).sum()
A = np.c_[(clusters, classes)]
tp = sum(comb(np.bincount(A[A[:, 0] == i, 1]), 2).sum()
for i in set(clusters))
fp = tp_plus_fp - tp
fn = tp_plus_fn - tp
tn = comb(len(A), 2) - tp - fp - fn
return (tp + tn) / (tp + fp + fn + tn)
clusters = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
classes = [0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 2, 2, 2, 0]
rand_index_score(clusters, classes)
0.6764705882352942
Upvotes: 0
Reputation: 77485
The best practice measures are indeed based on pair counting.
In particular the adjusted Rand index (ARI) is the standard measure here.
You don't actually count pairs, but the number of pairs from a set can trivially be computed using the binomial, simply (n*(n-1))>>2
.
You'll need this for each cluster and each cluster intersection.
The results of all intersections are aggregated, and it is easy to see that this is invariant to the permutation of clusters (and hence to the cluster labels). The Rand index is the accuracy of predicting whether two objects a, b are in the same cluster, or in different clusters. The ARI improves this by adjusting for chance: in a very unbalanced problem, a random result can score a high accuracy, but in ARI it is close to 0 on average.
Upvotes: 2