Reputation: 1673
I am trying to manually retrieve some of the statistics associated with clustering solutions based only on the data and the clusters assignments.
For instance, kmeans()
computes the between clusters and total sum of squares.
data <- iris[1:4]
fit <- kmeans(data, 3)
clusters <- fit$cluster
fit$betweenss
#> [1] 602.5192
fit$totss
#> [1] 681.3706
Created on 2021-08-09 by the reprex package (v2.0.1)
I would like to recover these indices without the call to kmeans
, using only data
and the vector of clusters
(so that I could apply that to any clustering solutions).
Thanks to this other post, I managed to retrieve the within clusters sum of squares, and I just lack the between and total now. For them, that other post says :
The total sum of squares, sum_x sum_y ||x-y||² is constant.
The total sum of squares can be computed trivially from variance.
If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.
But I don't know how to translate that to R... Any help is appreciated.
Upvotes: 1
Views: 2175
Reputation: 11056
This will compute the Total Sum of Squares (TSS), the Within Sum of Squares (WSS), and the Between Sum of Squares (BSS). You really only need the first two since BSS = TSS - WSS:
set.seed(42) # Set seed since kmeans uses a random start.
fit <- kmeans(data, 3)
clusters <- fit$cluster
# Subtract each value from the grand mean and get the number of observations in each cluster.
data.cent <- scale(data, scale=FALSE)
nrows <- table(clusters)
(TSS <- sum(data.cent^2))
# [1] 681.3706
(WSS <- sapply(split(data, clusters), function(x) sum(scale(x, scale=FALSE)^2)))
# 1 2 3
# 15.15100 39.82097 23.87947
(BSS <- TSS - sum(WSS))
# [1] 602.5192
# Compute BSS directly
gmeans <- sapply(split(data, clusters), colMeans)
means <- colMeans(data)
(BSS <- sum(colSums((gmeans - means)^2) * nrows))
# [1] 602.5192
Upvotes: 2