Sam S.
Sam S.

Reputation: 803

silhouette calculation in R for a large data

I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:

# load the data
# a data from the UCI website with 434874 obs. and  3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header =  F)

# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
               nstart=20,algorithm="MacQueen")

# calculate silhouette
library(cluster)   
sil <- silhouette(km_res$cluster, dist(data))

# plot silhouette
library(factoextra)
fviz_silhouette(sil)

The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.

I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.

I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.

Upvotes: 3

Views: 9229

Answers (3)

Rafael D&#237;az
Rafael D&#237;az

Reputation: 2289

If it is possible to calculate the Silhouette index, without using the distance matrix, alternatively you can use the clues package, optimizing both the time and the memory used by the cluster package. Here is an example:

library(rbenchmark)
library(cluster)
library(clues)

set.seed(123)
x = c(rnorm(1000,0,0.9), rnorm(1000,4,1), rnorm(1000,-5,1))
y = c(rnorm(1000,0,0.9), rnorm(1000,6,1), rnorm(1000, 5,1))
cluster = rep(as.factor(1:3),each = 1000)

df <- cbind(x,y)
head(df)
               x           y
[1,] -0.50442808 -0.13527673
[2,] -0.20715974 -0.29498142
[3,]  1.40283748 -1.30334876
[4,]  0.06345755 -0.62755613
[5,]  0.11635896  2.33864121
[6,]  1.54355849 -0.03367351

Runtime comparison between the two functions

 benchmark(f1 = silhouette(as.integer(cluster), dist = dist(df)),
           f2 = get_Silhouette(y = df, mem = cluster))
  test replications elapsed relative user.self sys.self user.child sys.child
1   f1          100   15.16    1.902     13.00     1.64         NA        NA
2   f2          100    7.97    1.000      7.76     0.00         NA        NA

Comparison in memory usage between the two functions

library(pryr)
object_size(silhouette(as.integer(cluster), dist = dist(df)))
73.9 kB
object_size(get_Silhouette(y = df, mem = cluster))
36.6 kB

As a conclusion clues::get_Silhouette, it reduces the time and memory used to the same.

Upvotes: 3

Sam S.
Sam S.

Reputation: 803

Anony-Mousse answer is perfect, particularly subsampling. This is very important for very large datasets due to the increase in computational cost.

Here is another solution for calculating internal measures such as silhouette and Dunn index, using an R package of clusterCrit. clusterCrit is for calculating clustering validation indices, which does not require entire distance matrix in advance. However, it might be slow as Anony-Mousse discussed. Please see the below link for documentation for clusterCrit: https://www.rdocumentation.org/packages/clusterCrit/versions/1.2.8/topics/intCriteria

clusterCrit also calculates most of Internal measures for cluster validation.

Example:

intCriteria(data,km_res$cluster,c("Silhouette","Calinski_Harabasz","Dunn"))

Upvotes: 3

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You can implement Silhouette yourself.

It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.

It will still take a LONG time though.

You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.

Upvotes: 7

Related Questions