Dominique Makowski
Dominique Makowski

Reputation: 1673

Compute between clusters sum of squares (BCSS) and total sum of squares manually (clustering in R)

I am trying to manually retrieve some of the statistics associated with clustering solutions based only on the data and the clusters assignments.

For instance, kmeans() computes the between clusters and total sum of squares.

data <- iris[1:4]
  
fit <- kmeans(data, 3)
clusters <- fit$cluster

fit$betweenss
#> [1] 602.5192
fit$totss
#> [1] 681.3706

Created on 2021-08-09 by the reprex package (v2.0.1)

I would like to recover these indices without the call to kmeans, using only data and the vector of clusters (so that I could apply that to any clustering solutions).

Thanks to this other post, I managed to retrieve the within clusters sum of squares, and I just lack the between and total now. For them, that other post says :

The total sum of squares, sum_x sum_y ||x-y||² is constant.

The total sum of squares can be computed trivially from variance.

If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.

But I don't know how to translate that to R... Any help is appreciated.

Upvotes: 1

Views: 2175

Answers (1)

dcarlson
dcarlson

Reputation: 11056

This will compute the Total Sum of Squares (TSS), the Within Sum of Squares (WSS), and the Between Sum of Squares (BSS). You really only need the first two since BSS = TSS - WSS:

set.seed(42)    # Set seed since kmeans uses a random start.
fit <- kmeans(data, 3)
clusters <- fit$cluster

# Subtract each value from the grand mean and get the number of observations in each cluster.
data.cent <- scale(data, scale=FALSE)
nrows <- table(clusters)

(TSS <- sum(data.cent^2))
# [1] 681.3706
(WSS <- sapply(split(data, clusters), function(x) sum(scale(x, scale=FALSE)^2)))
#        1        2        3 
# 15.15100 39.82097 23.87947 
(BSS <- TSS - sum(WSS))
# [1] 602.5192
# Compute BSS directly
gmeans <- sapply(split(data, clusters), colMeans)
means <- colMeans(data)
(BSS <- sum(colSums((gmeans - means)^2) * nrows))
# [1] 602.5192

Upvotes: 2

Related Questions