Johnny
Johnny

Reputation: 301

Cluster labels comparison - label match

I am comparing different clustering methods. For example Agglomerative Clustering with K-means, predicting from a sample, etc.

I am in python and mostly using pandas and sklearn.

The issue I have, of course, is that the cluster number the observations are assigned are different for every algorithm and I get something similar to this:

clustering comparison 1

expected clustering comparison 2

I am doing it manually for 8 clusters, but if I had more clusters it's a nightmare.

I think the idea is to relabel the results based on how many of the observations have in common. At the moment is when comparing the same number of clusters which should be easier.

Thanks!

Upvotes: 4

Views: 3873

Answers (2)

Elliott de Launay
Elliott de Launay

Reputation: 1168

contingency matrix worked for my use case, where K=6 and my label was binary: enter image description here

from sklearn.metrics.cluster import contingency_matrix

contingency_matrix(y_val_tr, clustering.labels_)

Outputs something like:

array([[ 8, 15,  7,  0, 19,  9],
       [ 1,  0, 13, 16,  0,  0]])

Where the first row are number of labels agreeing with predicted label 0, and the second row are number of labels agreeing with predicted label 1. For my use case I went column by column and just took the whichever row had the max value to relabeled and evaluate KMeans performance:

enter image description here

Upvotes: 1

Pallie
Pallie

Reputation: 1099

Build a contingency matrix with the output of both models. If you want a similarity-type scoring use the adjusted rand index.

Upvotes: 1

Related Questions