Problems with K-means clustering in R

Question

When I try to do K-means clustering on standard iris data

library('tidyverse')
iris_Cluster <- kmeans(iris[, 3:4], 2, nstart = 10)
iris$cluster <- as.factor(iris_Cluster$cluster)
p_iris <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color=cluster)) + geom_point()
print(p_iris)

I get one point belongs to wrong cluster. What is the problem? Is this weakness of K-means clustering algorithm? How to get appropriate result? What are good algorithms for partitional clustering?

G5W · Accepted Answer

The point that belongs to the "wrong" cluster is point 99. It has Petal.Length = 3 and Petal.Width = 1.1. You can get the centers of your clusters from

iris_Cluster$centers
  Petal.Length Petal.Width
1     4.925253   1.6818182
2     1.492157   0.2627451

You can see the distance from point 99 to the cluster centers using

as.matrix(dist(rbind(iris_Cluster$centers, iris[99,3:4])))
          1        2       99
1  0.000000 3.714824 2.011246
2  3.714824 0.000000 1.724699
99 2.011246 1.724699 0.000000

Point 99 is closer to the cluster center at (1.49, 0.26). The problem is that k-means chooses the cluster center that is closest to a point, not the center that makes sense based on things like the cluster of nearby points. As suggested by @Anony-Mousse , DBSCAN may be more to your liking. The DB part stands for Density Based and it creates clusters in which the points can be connected through regions of high density. Another option is single link hierarchical clustering that tends to put points that are near each other in the same cluster.

Mimicking your code but using hclust:

library(ggplot2)
iris_HC <- hclust(dist(iris[,3:4]), method="single")
iris_Cluster <- cutree(iris_HC, 2)
iris$cluster <- as.factor(iris_Cluster)

p_iris <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=cluster)) + geom_point()
print(p_iris)

Problems with K-means clustering in R

Answers (2)

Related Questions