joaoal
joaoal

Reputation: 1982

How create cluster plots for large datasets in R

I use the CLARA algorithm from Kaufman and Rousseeuw to cluster a large dataset with N > 8*10^6 in R. The implementation of the algorithm itself allows the user to control execution time by e.g. limiting the samplesize to n=100.

However it seems that the use of the plot() function in R includes all data-objects to the plot which results in a very large processing time and very crowded plots (see the reproducible example below).

In theory it should be possible to only plot the best sample from CLARA instead of N. Is there an implementation for this or how can I work around this issue?

## generate 2.5 mio objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10^6,0,0.5), rnorm(10^6,0,0.5)),
           cbind(rnorm(1.5*10^6,5,0.5), rnorm(1.5*10^6,5,0.5)))

library("cluster")
# get clusters solution
clara.x<-clara(x,k=2,sampsize = 100)
# see medoids
clara.x$medoids

# plot the cluster solution
plot(clara.x) # takes long time. creates crowded plot
clusplot(clara.x) # did not finish

enter image description here

Upvotes: 0

Views: 1425

Answers (2)

John Palowitch
John Palowitch

Reputation: 307

First off, it seems like plot() for clara objects gives two plots, the first being identical to that returned by clusplot(). If the former finished but the latter did not, I'm guessing that's just because you're clogging up the plot history. If you save large plots to png you won't run into this problem. They'll still take a while, but it won't interfere with whatever else it is you're doing.

Regarding reducing the number of plotted points, we can do this manually by adjusting the list elements of clara.x. You just have to choose which points you want to plot. Below, I give an example where I just use the samples from the clara method. But if you want to plot more you can choose with sample() or something:

# Manually shrinking clara object
samp <- clara.x$sample
clara.x$data <- clara.x$data[samp, ]
clara.x$clustering <- clara.x$clustering[samp]
clara.x$i.med <- match(clara.x$i.med, samp) # point medoid indx to samp

# plot the cluster solution
clusplot(clara.x)

One delicacy is that the medoid samples must always be in whatever indices you choose to plot, otherwise the 5th line above won't work. To ensure this for any given samp, add the following after the 2nd line above:

samp <- union(samp, clara.x$i.med)

ADDENDUM: Just saw the 1st answer, which is different from mine. He is suggesting to re-compute the clustering. A benefit to my approach is it maintains the original clustering computation and only adjusts which points you plot.

Upvotes: 1

Weihuang Wong
Weihuang Wong

Reputation: 13108

I'm not familiar with the CLARA method, so this answer responds directly to your question about how to "plot the best sample from CLARA."

A quick review of ?clara.object shows that the case numbers of the sample used in the final partition is found in the sample component, hence you can recover the observations by

best_samp <- x[clara.x$sample, ]

Output:

par(mfrow = c(1, 2))
plot(best_samp, main = "scatterplot")
clusplot(clara(best_samp, k = 2, sampsize = nrow(best_samp)),
  main = "clusplot")

enter image description here

Upvotes: 1

Related Questions