Fine-tuning UMAP parameters for clustering using HDBSCAN relative_validity (DBCV) scores

Question

I am using UMAP and HDBSCAN to cluster similar embedded text data (https://towardsdatascience.com/clustering-sentence-embeddings-to-identify-intents-in-short-text-48d22d3bf02e). There are multiple groups and I want to identify clusters within each group. One of the challenges is identifying parameters for UMAP and HDBSCAN as I expect the parameters to be different for each group.

I am considering using DBCV scores to find the ideal parameters. From what I understand, relative_validity from HDBSCAN (which is the DBCV score) can be used to find parameters for HDBSCAN, but I have not read anything about using this score to find parameters for a precursor step like UMAP. What are the (if any) concerns with using relative validity value from the HDBSCAN clusterer to fine tune the parameters for UMAP? Is this a valid approach?

Edge cases to consider: Some groups may have no clusters.

I have tried implementing this and I am getting mixed results. It is difficult to check because I do not have labels to serve as ground-truth for the clusters, so I have to check manually to see if the results make sense. Will keep investigating, but curious to hear thoughts from others on this approach.

I also tested an edge case where I would not expect any clusters. When HDBSCAN predicts no clusters, the relative_validity_ value is 0 as expected. When it does predict clusters, the relative_validity_ value is more than 0, but a small value that is less than 0.01. Maybe a threshold can be set to establish when a proposed clustering structure is valid.

Fine-tuning UMAP parameters for clustering using HDBSCAN relative_validity (DBCV) scores

Answers (0)

Related Questions