Guillon
Guillon

Reputation: 5

Low silhouette scores in mixed data clustering: Impact of categorical variables and possible solutions?

I am clustering a dataset with both numerical and categorical variables. To handle the high dimensionality, I performed dimensionality reduction separately for both types of inputs, retaining 21 factors that explain at least 85% of the variance. For categorical variables, I used dummy coding. I then applied K-Means, Hierarchical Clustering, and GMM, fine-tuning the models with different distance metrics. While the cluster distributions appear logical, the silhouette scores remain consistently low (below 0.07). Could this be due to the binary nature of the categorical variables after dummy coding? How can I improve clustering performance in this scenario? Any insights or alternative approaches would be greatly appreciated! (For reference, I also tried K-Prototypes without dimensionality reduction, but the results were similar.)"

AND THE SOLUTION: (I couldn't posted it below as a response)

For the community's information, we found the solution by applying Factorial Ordination and Mixed Data (FOMD) approach to simultaneously reduce dimensionality for both numerical and categorical data. This yielded better results than our initial attempt, where we reduced dimensionality separately for numeric and categorical variables. Hope this can help others as well! ^^

Upvotes: 0

Views: 40

Answers (0)

Related Questions