Reputation: 11
I was doing clustering with categorical data. I came across Kmodes algo and found it to be perfect for my requirements. Now, I want to measure dissimilarity within a cluster for all the clusters. I am thinking to measure the dissimilarity with a cluster and reduce it as much as possible. Is there any way to do that? Alternatively, is there any way to check how efficiently my data has been clustered? Since my data is categorical, ways which consider distance as a metric might not be helpful.
Upvotes: 1
Views: 1912
Reputation: 2585
To measure the dissimilarity within a cluster you need to come up with some kind of a metric. For categorical data, one of the possible ways of calculating dissimilarity could be the following:
d(i, j) = (p - m) / p
where:
p
is the number of classes/categories in your datam
is the number of matches you have between samples i
and j
For example, if your data has 3 categorical features and the samples, i
and j
are as follows:
Feature1 Feature2 Feature3
i x y z
j x w z
So here, we have 3 categorical features, so p=3
and out of these three, two features have same values for the samples i
and j
, so m=2
. Therefore
d(i,j) = (3 - 2) / 3
d(i,j) = 0.33
Another alternative is to convert your categorical variables to one-hot-encoded features and then compute jaccard simmilarity.
So, in order to measure the dissimilarity within a cluster you could calculate pairwise dissimilarity between each object in your cluster and then take the average of that.
Based on these measures you may also use the silhoutte score for evaluating the quality of your clustering (but you need to take it with a grain of salt, sometimes the score can be good while the clustering might not be what you expected).
Upvotes: 1