drastega
drastega

Reputation: 1773

Problems with K-means clustering in R

When I try to do K-means clustering on standard iris data

library('tidyverse')
iris_Cluster <- kmeans(iris[, 3:4], 2, nstart = 10)
iris$cluster <- as.factor(iris_Cluster$cluster)
p_iris <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color=cluster)) + geom_point()
print(p_iris)

enter image description here

I get one point belongs to wrong cluster. What is the problem? Is this weakness of K-means clustering algorithm? How to get appropriate result? What are good algorithms for partitional clustering?

Upvotes: 0

Views: 738

Answers (2)

G5W
G5W

Reputation: 37661

The point that belongs to the "wrong" cluster is point 99. It has Petal.Length = 3 and Petal.Width = 1.1. You can get the centers of your clusters from

iris_Cluster$centers
  Petal.Length Petal.Width
1     4.925253   1.6818182
2     1.492157   0.2627451

You can see the distance from point 99 to the cluster centers using

as.matrix(dist(rbind(iris_Cluster$centers, iris[99,3:4])))
          1        2       99
1  0.000000 3.714824 2.011246
2  3.714824 0.000000 1.724699
99 2.011246 1.724699 0.000000

Point 99 is closer to the cluster center at (1.49, 0.26). The problem is that k-means chooses the cluster center that is closest to a point, not the center that makes sense based on things like the cluster of nearby points. As suggested by @Anony-Mousse , DBSCAN may be more to your liking. The DB part stands for Density Based and it creates clusters in which the points can be connected through regions of high density. Another option is single link hierarchical clustering that tends to put points that are near each other in the same cluster.

Mimicking your code but using hclust:

library(ggplot2)
iris_HC <- hclust(dist(iris[,3:4]), method="single")
iris_Cluster <- cutree(iris_HC, 2)
iris$cluster <- as.factor(iris_Cluster)

p_iris <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=cluster)) + geom_point()
print(p_iris)

Cluster data

Upvotes: 1

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

Yes, by the sum-of-squares objective, this point belongs to the red cluster.

Consider, e.g., DBSCAN.

Upvotes: 1

Related Questions