Jayzpeer
Jayzpeer

Reputation: 81

Get cluster mean in k-means clustering analysis with R

I created two clusters using the k-means algorithm. Each cluster contains 4 variables. If I want to get the means of each variables in each cluster, should I do:

clusteredsubset$centers

or

colMeans(y[clusteredsubset$cluster == 1,])
colMeans(y[clusteredsubset$cluster == 2,])

where y is the data matrix (4 columns) and clusteredsubset is the result of kmeans.

Upvotes: 1

Views: 3184

Answers (2)

Zheyuan Li
Zheyuan Li

Reputation: 73385

Either one is fine, as they give the same result. But since kmeans returns centers, why not use it?

The following is based on the first example from ?kmeans:

set.seed(0)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, 2)

## what `kmeans` returns
cl$centers
#              x            y
#1 -0.0008158201 -0.008394296
#2  0.9261878482  1.029984748

## manual computation
colMeans(x[cl$cluster == 1, ])
#            x             y 
#-0.0008158201 -0.0083942957 

colMeans(x[cl$cluster == 2, ])
#        x         y 
#0.9261878 1.0299847 

The results are exactly the same (the difference in number of digits is just a printing effect).

## make a plot
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)

enter image description here

Upvotes: 3

SolomonRoberts
SolomonRoberts

Reputation: 114

I would use:

means = sapply(split(clusteredsubset, clusteredsubset$cluster),function(x)
     {return(sapply(x,function(x){return(mean(x))}))})

Upvotes: 1

Related Questions