Andrew Ng
Andrew Ng

Reputation: 340

Comparing HDBSCAN labels with soft cluster results

I'm getting the soft clusters from a dataset using HDBSCAN as follows:

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
closest_clusters = [np.argmax(x) for x in soft_clusters]

soft_clusters is a 2D array of the probabilities that a data point belongs to each cluster, so closest_clusters should be an array with the label that the data point is most likely to belong to. However, when I compare closest_clusters with clusterer.labels_ (the label that HDBSCAN assigns the data point), I find that almost none of the clusters match up for the data points that have a label, i.e. a data point with label 3 has 4 as its closest cluster.

I'm not sure if I'm misunderstanding how soft clustering works or if something is wrong with the code. Any help is appreciated!

Upvotes: 3

Views: 1996

Answers (1)

Isopycnal Oscillation
Isopycnal Oscillation

Reputation: 3384

The author of HDBSCAN has attempted to fix this problem but it seems that, as it stands, it is simply how it works and there is no way to fix it without some major restructuring. Here is his comment:

Digging in to this I think the answer (unfortunately?) is that this is "just how it works". The soft clustering considers the distance from exemplars, and the merge height in the tree between the point and each of the clusters. These points that end up "wrong" are points that sit on a split in the tree -- they have the same merge height to their own cluster (perhaps that is a bug, I'll look into it further). That means tree-wise we don't distinguish them, and in terms of pure ambient distance to exemplars they are closer to the "wrong" cluster, and so get misclassified. This is a little weird, but the soft clustering is ultimately a little different that the hard clustering, so corner cases like this can theoretically occur.

Upvotes: 3

Related Questions