Reputation: 18830
I am in the process of implementing few algorithms for cluster analysis especially cluster validation. There are few ways such as cross validation, external index, internal index, relative index. I am trying to implement an algorithm that is under internal index.
Internal index - Based on the intrinsic content of the data. It is used to measure the goodness of a clustering structure without respect to external information. My interest is Silhouette Coefficient
s(i) = b(i) - a(i) / max{a(i), b(i)}
To make it more clear lets assume I have following multi-model distribution:
library(mixtools)
wait = faithful$waiting
mixmdl = normalmixEM(wait)
plot(mixmdl,which=2)
lines(density(wait), lty=2, lwd=2)
We see that there are two clusters and cut off mark is around 68. There are no label data here so no ground truth to do cross-validation (Un-Supervised). So we need a mechanism to evaluate the clusters. In this case we know there are two cluster from visualization but how do we clear show that two distributions are actually belong to cluster. Base on what I red on wikipedia Silhouette gives us that validation.
I want to implement a method (which implements Silhouette) such that it takes a r list of values in my example its wait, number of clusters in this case 2, and the model which is the model and return average s(i).
I have started but can't really figure out how to go forward
Silhouette = function(rList, num_clusters, model) {
}
summary of my list looks like this:
Length Class Mode
clust_A 416014 -none- numeric
clust_B 72737 -none- numeric
clust_C 6078 -none- numeric
myList$clust_A
will return points that are belong to that cluster
[1] 13 880 497 1864 392 55 1130 248 437 37 62 153 60 117
[15] 22 106 71 1026 446 1558 23 56 287 402 46 1506 115 2700
[29] 67 134 48 536 41 506 1098 33 30 280 225 16 25 17
[43] 63 1762 477 174 98 76 157 698 47 312 40 3 198 621
[57] 15 34 226 657 48 110 23 250 14 32 137 272 26 257
[71] 270 133 1734 78 134 8 5 225 187 166 35 15 94 2825
[85] 2 8 94 89 54 91 77 17 106 1397 16 25 16 103
problem is that I don't think the existing library accept this type of data structure.
Upvotes: 0
Views: 305
Reputation: 77495
Silhouette assumes that all clusters have the same variance.
IMHO, it does not make sense to use this measure with EM clustering.
Upvotes: 1