user1571823
user1571823

Reputation: 394

HDBSCAN Python choose number of clusters

Is is possible to select the number of clusters in the HDBSCAN algorithm in python? Or the only way is to play around with the input parameters such as alpha, min_cluster_size?

Thanks

UPDATE: here is the code to use fcluster and hdbscan

import hdbscan
from scipy.cluster.hierarchy import fcluster

clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
Z = clusterer.single_linkage_tree_.to_numpy()
labels = fcluster(Z, 2, criterion='maxclust')

Upvotes: 7

Views: 8541

Answers (2)

Annie
Annie

Reputation: 305

Thankfully, on June 2020 a contributor on GitHub (Module for flat clustering) provided a commit that adds code to hdbscan that allows us to choose the number of resulting clusters.

To do so:

from hdbscan import flat

clusterer = flat.HDBSCAN_flat(train_df, n_clusters, prediction_data=True)
flat.approximate_predict_flat(clusterer, points_to_predict, n_clusters)

You can find the code here flat.py You should be able to choose the number of clusters for test points using approximate_predict_flat.

In addition, a jupyter notebook has also been written explaining how to use it, Here.

Upvotes: 6

Leland McInnes
Leland McInnes

Reputation: 346

If you explicitly need to get a fixed number of clusters then the closest thing to managing that would be to use the cluster hierarchy and perform a flat cut through the hierarchy at the level that gives you the desired number of clusters. That does involve working with one of the tree objects that HDBSCAN exposes and getting your hands a little dirty, but it can be done.

Upvotes: 2

Related Questions