emax
emax

Reputation: 7245

Python: how to compare the similarity between clustering using k-means algorithm?

I have two observations of the same event. Let say X and Y. I suppose to have nc clusters. I am using sklearn to make the clustering.

x = KMeans(n_clusters=nc).fit_predict(X)
y = KMeans(n_clusters=nc).fit_predict(Y)

is there a measure that allow me to compare x and y: i.e. this measure will be 1 if the clusters x and y are the same.

Upvotes: 0

Views: 2948

Answers (2)

Sam A.
Sam A.

Reputation: 453

The Rand Index and its adjusted version do this exactly. Two cluster assignments that match (even if the labels themselves, which are treated as arbitrary, are different), get a score of 1. A value of 0 means they don't agree at all. The Adjusted Rand Index uses its baseline as random assignment of points to clusters.

Upvotes: 1

sascha
sascha

Reputation: 33522

Just extract the cluster centers of your kmeans-objects (see the docs):

x_centers = x.cluster_centers_
y_centers = y.cluster_centers_

The you have to decide which metric you are using to compare these. Keep in mind that the centers are floating-points, the clustering-process is a heuristic and the clustering-process is a random-algorithm. This means, you will get something which interprets as not exactly the same with a high probability, even for cluster-objects trained on the same data.

This link discusses some approaches and the problems.

Upvotes: 2

Related Questions