Beg
Beg

Reputation: 405

Deciding to the clustering algorithm for the dataset containing both categorical and numerical variables

I am a newbie in machine learning and trying to make a segmentation with clustering algorithms. However, Since my dataset has both categorical variables (such as gender, marital status, preferred social media platform etc) as well as numerical variables ( average expenditure, age, income etc.), I could not decide which algorithms worth to focus on. Which one should I try: fuzzy c means, k-medoids, or latent class to compare with k-means++? which ones would yield better results for these type of mixed datasets?

Bonus question: Should I try to do clustering without dimensionality reduction? or should I use PCA or K-PCA in any case to decrease dimensions? Also, how can I understand and interpret results without visualization if the dataset has more than 3 dimensions ?

Upvotes: -1

Views: 530

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

The best thing to try is hierarchical agglomerative clustering with a distance metric such as Gower's.

Mixed data with different scales usually does not work in any statistical meaningful way. You have too many weights to choose, so no result will be statistically well founded, but largely a result of your weighting. So it's impossible to argue that some result is the "true" clustering. Don't expect the results to be very good thus.

Upvotes: 1

bits
bits

Reputation: 1705

Generally when you have categorical data you try to encode them into a "numerical" value. Now in your case consider social media : twitter, facebook, google-plus. You might be tempted to encode them as twitter:0 , facebook: 1, google-plus: 2. But this encoding has problem: it is implying to machine learning algorithm google-plus is twice the facebook, which is not what you want.

Enter one hot encoding: it converts categorical data into vector of bits . So you will have number of bits equal to number of categories present in your data:

social media  |  binary vector (bits in order: is_twitter, is_facebook, is_google_plus)
twitter       |  1, 0, 0
facebook      |  0, 1, 0
google-plus   |  0, 0, 1

Now you can apply any ML algorithm, since all of your data is numerical.

More here: One hot encoding in scikit

Upvotes: 0

Related Questions