Reputation: 1773
When I try to do K-means clustering on standard iris data
library('tidyverse')
iris_Cluster <- kmeans(iris[, 3:4], 2, nstart = 10)
iris$cluster <- as.factor(iris_Cluster$cluster)
p_iris <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color=cluster)) + geom_point()
print(p_iris)
I get one point belongs to wrong cluster. What is the problem? Is this weakness of K-means clustering algorithm? How to get appropriate result? What are good algorithms for partitional clustering?
Upvotes: 0
Views: 738
Reputation: 37661
The point that belongs to the "wrong" cluster is point 99. It has Petal.Length = 3 and Petal.Width = 1.1. You can get the centers of your clusters from
iris_Cluster$centers
Petal.Length Petal.Width
1 4.925253 1.6818182
2 1.492157 0.2627451
You can see the distance from point 99 to the cluster centers using
as.matrix(dist(rbind(iris_Cluster$centers, iris[99,3:4])))
1 2 99
1 0.000000 3.714824 2.011246
2 3.714824 0.000000 1.724699
99 2.011246 1.724699 0.000000
Point 99 is closer to the cluster center at (1.49, 0.26). The problem is that k-means chooses the cluster center that is closest to a point, not the center that makes sense based on things like the cluster of nearby points. As suggested by @Anony-Mousse , DBSCAN may be more to your liking. The DB part stands for Density Based and it creates clusters in which the points can be connected through regions of high density. Another option is single link hierarchical clustering that tends to put points that are near each other in the same cluster.
Mimicking your code but using hclust
:
library(ggplot2)
iris_HC <- hclust(dist(iris[,3:4]), method="single")
iris_Cluster <- cutree(iris_HC, 2)
iris$cluster <- as.factor(iris_Cluster)
p_iris <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=cluster)) + geom_point()
print(p_iris)
Upvotes: 1
Reputation: 77495
Yes, by the sum-of-squares objective, this point belongs to the red cluster.
Consider, e.g., DBSCAN.
Upvotes: 1