Reputation: 35
I am using the iris dataset in R. I clustered the data using K-means; the output is the variable km.out. However, I cannot find an easy way to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica). I created a manual way to do it but I have to set the seed and it's very manual. There has to be a better way to do it. Any thoughts?
Here is what I did manually:
for (i in 1:length(km.out$cluster)) {
if (km.out$cluster[i] == 1) {
km.out$cluster[i] = "versicolor"
}
}
for (i in 1:length(km.out$cluster)) {
if (km.out$cluster[i] == 2) {
km.out$cluster[i] = "setosa"
}
}
for (i in 1:length(km.out$cluster)) {
if (km.out$cluster[i] == 3) {
km.out$cluster[i] = "virginica"
}
}
Upvotes: 2
Views: 1962
Reputation: 707
if you want to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica), you'll likely not have a 1:1 correspondence. But you could assign the most frequent species in each cluster like this:
data(iris)
# k-means clustering
set.seed(5834)
km.out <- kmeans(iris[,1:4], centers = 3)
# associate species with clusters
(cmat <- table(Species = iris[,5], cluster = km.out$cluster))
#> cluster
#> Species 1 2 3
#> setosa 33 17 0
#> versicolor 0 4 46
#> virginica 0 0 50
# find the most-frequent species in each cluster
setNames(rownames(cmat)[apply(cmat, 2, which.max)], colnames(cmat))
#> 1 2 3
#> "setosa" "setosa" "virginica"
# find the most-frequent assigned cluster per species
setNames(colnames(cmat)[apply(cmat, 1, which.max)], rownames(cmat))
#> setosa versicolor virginica
#> "1" "3" "3"
Created on 2021-09-22 by the reprex package (v2.0.1)
Upvotes: 0
Reputation: 171
You can recode the cluster number and add it back to the original data with:
library(dplyr)
mutate(iris,
cluster = case_when(km.out$cluster == 1 ~ "versicolor",
km.out$cluster == 2 ~ "setosa",
km.out$cluster == 3 ~ "virginica"))
Alternatively you can use a vector translation approach to recoding a vector with elucidate::translate()
remotes::install_github("bcgov/elucidate") #if elucidate isn't installed yet
library(dplyr)
library(elucidate)
mutate(iris,
cluster = translate(km.out$cluster,
old = c(1:3),
new = c("versicolor",
"setosa",
"virginica")))
Upvotes: 0
Reputation: 11046
It is not clear what you are trying to accomplish. The clusters created by kmeans
will not match the Species
exactly and there is no guarantee that clusters 1, 2, 3 will match the order of the species in iris
. Also as you noted, the results will vary depending on the value of the seed. For example,
set.seed(42)
iris.km <- kmeans(scale(iris[, -5]), 3)
table(iris.km$cluster, iris$Species)
#
# setosa versicolor virginica
# 1 50 0 0
# 2 0 39 14
# 3 0 11 36
Cluster 1 is exactly associated with setosa, but cluster 2 combines versicolor and virginica as does cluster 3.
Upvotes: 3
Reputation: 76402
R is a vectorized language, the following one-liner is equivalent to the code in the question.
km.out$cluster <- c("versicolor", "setosa", "virginica")[km.out$cluster]
Upvotes: 4