Reputation: 394
Is is possible to select the number of clusters in the HDBSCAN algorithm in python? Or the only way is to play around with the input parameters such as alpha, min_cluster_size?
Thanks
UPDATE: here is the code to use fcluster and hdbscan
import hdbscan
from scipy.cluster.hierarchy import fcluster
clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
Z = clusterer.single_linkage_tree_.to_numpy()
labels = fcluster(Z, 2, criterion='maxclust')
Upvotes: 7
Views: 8541
Reputation: 305
Thankfully, on June 2020 a contributor on GitHub (Module for flat clustering) provided a commit that adds code to hdbscan that allows us to choose the number of resulting clusters.
To do so:
from hdbscan import flat
clusterer = flat.HDBSCAN_flat(train_df, n_clusters, prediction_data=True)
flat.approximate_predict_flat(clusterer, points_to_predict, n_clusters)
You can find the code here flat.py You should be able to choose the number of clusters for test points using approximate_predict_flat.
In addition, a jupyter notebook has also been written explaining how to use it, Here.
Upvotes: 6
Reputation: 346
If you explicitly need to get a fixed number of clusters then the closest thing to managing that would be to use the cluster hierarchy and perform a flat cut through the hierarchy at the level that gives you the desired number of clusters. That does involve working with one of the tree objects that HDBSCAN exposes and getting your hands a little dirty, but it can be done.
Upvotes: 2