Jack Armstrong
Jack Armstrong

Reputation: 1249

Validating Fuzzy Clustering

I would like to use fuzzy C-means clustering on a large unsupervided data set of 41 variables and 415 observations. However, I am stuck on trying to validate those clusters. When I plot with a random number of clusters, I can explain a total of 54% of the variance, which is not great and there are no really nice clusters as their would be with the iris database for example.

First I ran the fcm with my scales data on 3 clusters just to see, but if I am trying to find way to search for the optimal number of clusters, then I do not want to set an arbitrary defined number of clusters.

So I turned to google and googled: "valdiate fuzzy clustering in R." This link here was good, but I still have to try a bunch of different numbers of clusters. I looked at the advclust, ppclust, and clvalid packages but I could not find a walkthrough for the functions. I looked at the documentation of each package, but also could not discern what to do next.

I walked through some possible number of clusters and checked each one with the k.crisp object from fanny. I started with 100 and got down to 4. Based on object description in the documentation,

k.crisp=integer ( ≤ k ) giving the number of crisp clusters; can be less than k , where it's recommended to decrease memb.exp.

it doesn't seem like a valid way because it is comparing the number of crisp clusters to our fuzzy clusters.

Is there a function where I can check the validity of my clusters from 2:10 clusters? Also, is it worth while to check the validity of 1 cluster? I think that is a stupid question, but I have a strange feeling 1 optimal cluster might be what I get. (Any tips on what to do if I were to get 1 cluster besides cry a little on the inside?)

Code

library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())

# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)

website as mentioned above.

Upvotes: 0

Views: 1187

Answers (1)

boyaronur
boyaronur

Reputation: 551

As far as I know, you need to go through different number of clusters and see how the percentage of variance explained is changing with different number of clusters. This method is called elbow method.

wss <- sapply(2:10, 
       function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

plot(2:10, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

The resulting plot is

wss-number of clusters

After k = 5, total within cluster sum of squares tend to change slowly. So, k = 5 is a good candidate for being optimal number of clusters according to elbow method.

Upvotes: 1

Related Questions