Reputation: 301
I am comparing different clustering methods. For example Agglomerative Clustering with K-means, predicting from a sample, etc.
I am in python and mostly using pandas and sklearn.
The issue I have, of course, is that the cluster number the observations are assigned are different for every algorithm and I get something similar to this:
I am doing it manually for 8 clusters, but if I had more clusters it's a nightmare.
I think the idea is to relabel the results based on how many of the observations have in common. At the moment is when comparing the same number of clusters which should be easier.
Thanks!
Upvotes: 4
Views: 3873
Reputation: 1168
contingency matrix
worked for my use case, where K=6
and my label was binary:
from sklearn.metrics.cluster import contingency_matrix
contingency_matrix(y_val_tr, clustering.labels_)
Outputs something like:
array([[ 8, 15, 7, 0, 19, 9],
[ 1, 0, 13, 16, 0, 0]])
Where the first row are number of labels agreeing with predicted label 0
, and the second row are number of labels agreeing with predicted label 1
. For my use case I went column by column and just took the whichever row had the max value to relabeled and evaluate KMeans performance:
Upvotes: 1
Reputation: 1099
Build a contingency matrix with the output of both models. If you want a similarity-type scoring use the adjusted rand index.
Upvotes: 1