add-semi-colons
add-semi-colons

Reputation: 18830

R implementation cluster analysis

I am in the process of implementing few algorithms for cluster analysis especially cluster validation. There are few ways such as cross validation, external index, internal index, relative index. I am trying to implement an algorithm that is under internal index.

Internal index - Based on the intrinsic content of the data. It is used to measure the goodness of a clustering structure without respect to external information. My interest is Silhouette Coefficient

s(i) = b(i) - a(i) / max{a(i), b(i)}

To make it more clear lets assume I have following multi-model distribution:

  library(mixtools)
  wait = faithful$waiting
  mixmdl = normalmixEM(wait)
  plot(mixmdl,which=2)
  lines(density(wait), lty=2, lwd=2)

enter image description here

We see that there are two clusters and cut off mark is around 68. There are no label data here so no ground truth to do cross-validation (Un-Supervised). So we need a mechanism to evaluate the clusters. In this case we know there are two cluster from visualization but how do we clear show that two distributions are actually belong to cluster. Base on what I red on wikipedia Silhouette gives us that validation.

I want to implement a method (which implements Silhouette) such that it takes a r list of values in my example its wait, number of clusters in this case 2, and the model which is the model and return average s(i).

I have started but can't really figure out how to go forward

Silhouette = function(rList, num_clusters, model) {

}

summary of my list looks like this:

               Length Class  Mode   
clust_A         416014 -none- numeric
clust_B         72737 -none- numeric
clust_C          6078 -none- numeric

myList$clust_A will return points that are belong to that cluster

    [1]   13  880  497 1864  392   55 1130  248  437   37   62  153   60  117
   [15]   22  106   71 1026  446 1558   23   56  287  402   46 1506  115 2700
   [29]   67  134   48  536   41  506 1098   33   30  280  225   16   25   17
   [43]   63 1762  477  174   98   76  157  698   47  312   40    3  198  621
   [57]   15   34  226  657   48  110   23  250   14   32  137  272   26  257
   [71]  270  133 1734   78  134    8    5  225  187  166   35   15   94 2825
   [85]    2    8   94   89   54   91   77   17  106 1397   16   25   16  103

problem is that I don't think the existing library accept this type of data structure.

Upvotes: 0

Views: 305

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

Silhouette assumes that all clusters have the same variance.

IMHO, it does not make sense to use this measure with EM clustering.

Upvotes: 1

Related Questions