Cluster labels comparison - label match

I am comparing different clustering methods. For example Agglomerative Clustering with K-means, predicting from a sample, etc.

I am in python and mostly using pandas and sklearn.

The issue I have, of course, is that the cluster number the observations are assigned are different for every algorithm and I get something similar to this:

I am doing it manually for 8 clusters, but if I had more clusters it's a nightmare.

I think the idea is to relabel the results based on how many of the observations have in common. At the moment is when comparing the same number of clusters which should be easier.

Thanks!

Upvotes: 4

Answers (2)

Elliott de Launay

Reputation: 1168

contingency matrix worked for my use case, where K=6 and my label was binary:

from sklearn.metrics.cluster import contingency_matrix

contingency_matrix(y_val_tr, clustering.labels_)

Outputs something like:

array([[ 8, 15,  7,  0, 19,  9],
       [ 1,  0, 13, 16,  0,  0]])

Where the first row are number of labels agreeing with predicted label 0, and the second row are number of labels agreeing with predicted label 1. For my use case I went column by column and just took the whichever row had the max value to relabeled and evaluate KMeans performance:

Upvotes: 1

Pallie

Reputation: 1099

Build a contingency matrix with the output of both models. If you want a similarity-type scoring use the adjusted rand index.

Upvotes: 1

Cluster labels comparison - label match

Answers (2)

Related Questions