Reputation: 19
Why can't we use the Eculidean Distance for Clustering of Categorical Variables and Why we use Gower Distance for the clustering of Categorical Variables. I am just looking for a simple logic and difference of working between the two for clustering of Categorical variables.
I have tried to find the same on the google search, but was unable to find anything concrete and logical about the same.
Upvotes: 2
Views: 1082
Reputation: 4264
Euclidean distance can be used if your categorical data is ordinal in nature, where if you reasonably encode the data, you can find the Euclidean distance which actually has some sense. For example assume that you are dealing with the results of a survey conducted on a Likert scale and your levels are Very Good, Good, Neutral, Bad and Very Bad and if you choose to encode them as 5,4,3,2 and 1 respectively and compute the distance between any pair of them, they actually makes sense (distance between bad and very good is 3 which is meaningful).
But on the other hand if your variables are categorical but nominal in nature where there is no inherent ordering, computing distances doesn't make sense. For example assume that your feature is color and they take values Red, Blue, Green and Pink. And you encode them as 4,3,2 and 1 respectively. Now even if you find distance between Green and Red and report it as 2, it actually means nothing, like you can't make a statement like Red is varying from Green by 2 units.
In case of nominal variables you could use Hamming distance or Gower distance or Gower distance in R if you have mixed data.
Upvotes: 3