Which clustering technique and distance measure should I choose if I have a dataset with binary, continuous variables, and different distributions?

Question

I am performing clustering in Python to find groups in a dataset.

I have already handled missing values, understood my distributions, dealt with outliers, transformed categorical features into binaries with get_dummies, and normalized other columns between 0 and 1.

My question arises when searching for the clustering technique and the choice of the optimal distance measure because, during my research, I found out that there are some for continuous quantitative variables and others for binary ones. Additionally, the columns also have different distributions.

I saw that there is a method to change the measure according to the type of variable, but I would like to ask if you know of another way or if you can guide me.

I tried handling missing values, understanding distributions, dealing with outliers, transforming categorical features into binaries with get_dummies, and normalizing columns between 0 and 1. I expected to understand the influence of these preprocessing steps on clustering and to determine the appropriate technique and distance measure. However, I am unsure which clustering method and distance measure to apply, given the mixed types of variables (binary and continuous) and their different distributions.

Which clustering technique and distance measure should I choose if I have a dataset with binary, continuous variables, and different distributions?

Answers (0)

Related Questions