Reputation: 1
I am performing clustering in Python to find groups in a dataset.
I have already handled missing values, understood my distributions, dealt with outliers, transformed categorical features into binaries with get_dummies, and normalized other columns between 0 and 1.
My question arises when searching for the clustering technique and the choice of the optimal distance measure because, during my research, I found out that there are some for continuous quantitative variables and others for binary ones. Additionally, the columns also have different distributions.
I saw that there is a method to change the measure according to the type of variable, but I would like to ask if you know of another way or if you can guide me.
I tried handling missing values, understanding distributions, dealing with outliers, transforming categorical features into binaries with get_dummies, and normalizing columns between 0 and 1. I expected to understand the influence of these preprocessing steps on clustering and to determine the appropriate technique and distance measure. However, I am unsure which clustering method and distance measure to apply, given the mixed types of variables (binary and continuous) and their different distributions.
Upvotes: 0
Views: 53