Eduardo
Eduardo

Reputation: 1275

Why doesn't sklearn.cluster.AgglomerativeClustering give us the distances between the merged clusters?

I'm using sklearn.cluster.AgglomerativeClustering. It begins with one cluster per data point and iteratively merges together the two "closest" clusters, thus forming a binary tree. What constitutes distance between clusters depends on a linkage parameter.

It would be useful to know the distance between the merged clusters at each step. We could then stop when the next to be merged clusters get too far apart. Alas, that does not seem to be available in AgglomerativeClustering.

Am I missing something? Is there a way to recover the distances?

Upvotes: 4

Views: 10401

Answers (2)

erobertc
erobertc

Reputation: 644

When this question was originally asked, and when the other answer was posted, sklearn did not expose the distances. It now does, however, as demonstrated in this example and this answer to a similar question.

Upvotes: 1

σηγ
σηγ

Reputation: 1324

You might want to take a look at scipy.cluster.hierarchy which offers somewhat more options than sklearn.cluster.AgglomerativeClustering.

The clustering is done with the linkage function which returns a matrix containing the distances between the merged clusters. These can be visualised with a dendrogram:

from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, cl = make_blobs(n_samples=20, n_features=2, centers=3, cluster_std=0.5, random_state=0)
Z = linkage(X, method='ward')

plt.figure()
dendrogram(Z)
plt.show()

dendrogram.png

One can form flat clusters from the linkage matrix based on various criteria, e.g. the distance of observations:

clusters = fcluster(Z, 5, criterion='distance')

Scipy's hierarchical clustering is discussed in much more detail here.

Upvotes: 7

Related Questions