Reputation: 405
I am a newbie in machine learning and trying to make a segmentation with clustering algorithms. However, Since my dataset has both categorical variables (such as gender, marital status, preferred social media platform etc) as well as numerical variables ( average expenditure, age, income etc.), I could not decide which algorithms worth to focus on. Which one should I try: fuzzy c means, k-medoids, or latent class to compare with k-means++? which ones would yield better results for these type of mixed datasets?
Bonus question: Should I try to do clustering without dimensionality reduction? or should I use PCA or K-PCA in any case to decrease dimensions? Also, how can I understand and interpret results without visualization if the dataset has more than 3 dimensions ?
Upvotes: -1
Views: 530
Reputation: 77454
The best thing to try is hierarchical agglomerative clustering with a distance metric such as Gower's.
Mixed data with different scales usually does not work in any statistical meaningful way. You have too many weights to choose, so no result will be statistically well founded, but largely a result of your weighting. So it's impossible to argue that some result is the "true" clustering. Don't expect the results to be very good thus.
Upvotes: 1
Reputation: 1705
Generally when you have categorical data you try to encode them into a "numerical" value. Now in your case consider social media : twitter, facebook, google-plus. You might be tempted to encode them as twitter:0 , facebook: 1, google-plus: 2. But this encoding has problem: it is implying to machine learning algorithm google-plus is twice the facebook, which is not what you want.
Enter one hot encoding: it converts categorical data into vector of bits . So you will have number of bits equal to number of categories present in your data:
social media | binary vector (bits in order: is_twitter, is_facebook, is_google_plus) twitter | 1, 0, 0 facebook | 0, 1, 0 google-plus | 0, 0, 1
Now you can apply any ML algorithm, since all of your data is numerical.
More here: One hot encoding in scikit
Upvotes: 0