Francois
Francois

Reputation: 901

Why doesn't kmeans find the 3 clusters?

I run a kmeans on a 3 dimensional dataset and get the following result: 3D Plotting of the data set coloured based on clusters

Code as followed:

library(tidyr)

setwd('C:/temp/rwd')
getwd()

df <- read.table('data-1581352459203.csv', 
                 header = TRUE,
                 sep = ",")

dff <- df %>% pivot_wider(names_from = SensorId, values_from = last)

data = data.frame(dff$`3`, dff$`4`, dff$`5`)
cf.kmeans <- kmeans(data, centers = 3, nstart = 20)
cf.kmeans

library(plot3D)
x <- dff$`3`
y <- dff$`4`
z <- dff$`5`
scatter3D(x, y, z, 
          bty ="g", pch = cf.kmeans$cluster, colvar=as.numeric(cf.kmeans$cluster),
          xlab = "Temperature", ylab = "Humidity", zlab = "Speed",
          ticktype = "detailed")

library("plot3Drgl")
plotrgl()

Dataset looks like this (90 observations):

enter image description here

I would very much appreciate an explanation why kmeans does not find the obvious clusters.

Upvotes: 2

Views: 94

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

Your variables are on a different scale. You need to scale the data, otherwise the variables on a larger scale will dominate. See below for reproducible example:

library(plot3D)
set.seed(100)
mat = cbind(rnorm(60,rep(c(0,30,30),each=20),5),
            rnorm(60,rep(c(0,30,30),each=20),5),
            rnorm(60,rep(c(0,0,1),each=20),0.1)
)
clus = kmeans(mat,3,nstart = 20)

scatter3D(mat[,1],mat[,2],mat[,3],
          ticktype = "detailed",colvar=clus$cluster)

enter image description here

Above is similar to your result, now do scale:

clus=kmeans(scale(mat),3,nstart=20)
scatter3D(mat[,1],mat[,2],mat[,3],ticktype = "detailed",colvar=clus$cluster)

enter image description here

Upvotes: 2

Related Questions