Comparing HDBSCAN labels with soft cluster results

Question

I'm getting the soft clusters from a dataset using HDBSCAN as follows:

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
closest_clusters = [np.argmax(x) for x in soft_clusters]

soft_clusters is a 2D array of the probabilities that a data point belongs to each cluster, so closest_clusters should be an array with the label that the data point is most likely to belong to. However, when I compare closest_clusters with clusterer.labels_ (the label that HDBSCAN assigns the data point), I find that almost none of the clusters match up for the data points that have a label, i.e. a data point with label 3 has 4 as its closest cluster.

I'm not sure if I'm misunderstanding how soft clustering works or if something is wrong with the code. Any help is appreciated!

Comparing HDBSCAN labels with soft cluster results

Answers (1)

Related Questions