Jason
Jason

Reputation: 35

Naming clusters in R

I am using the iris dataset in R. I clustered the data using K-means; the output is the variable km.out. However, I cannot find an easy way to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica). I created a manual way to do it but I have to set the seed and it's very manual. There has to be a better way to do it. Any thoughts?

Here is what I did manually:

for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 1) {
    km.out$cluster[i] = "versicolor"
  }
}
for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 2) {
    km.out$cluster[i] = "setosa"
  }
}
for (i in 1:length(km.out$cluster)) {
  if (km.out$cluster[i] == 3) {
    km.out$cluster[i] = "virginica"
  }
}

Upvotes: 2

Views: 1962

Answers (4)

scrameri
scrameri

Reputation: 707

if you want to assign the cluster numbers (1-3) to a species (versicolor, setosa, virginica), you'll likely not have a 1:1 correspondence. But you could assign the most frequent species in each cluster like this:

data(iris)

# k-means clustering
set.seed(5834)
km.out <- kmeans(iris[,1:4], centers = 3)

# associate species with clusters
(cmat <- table(Species = iris[,5], cluster = km.out$cluster))
#>             cluster
#> Species       1  2  3
#>   setosa     33 17  0
#>   versicolor  0  4 46
#>   virginica   0  0 50

# find the most-frequent species in each cluster
setNames(rownames(cmat)[apply(cmat, 2, which.max)], colnames(cmat))
#>           1           2           3 
#>    "setosa"    "setosa" "virginica"

# find the most-frequent assigned cluster per species
setNames(colnames(cmat)[apply(cmat, 1, which.max)], rownames(cmat))
#>     setosa versicolor  virginica 
#>        "1"        "3"        "3"

Created on 2021-09-22 by the reprex package (v2.0.1)

Upvotes: 0

huttoncp
huttoncp

Reputation: 171

You can recode the cluster number and add it back to the original data with:

library(dplyr)
mutate(iris, 
       cluster = case_when(km.out$cluster == 1 ~ "versicolor",
                           km.out$cluster == 2 ~ "setosa",
                           km.out$cluster == 3 ~ "virginica"))

Alternatively you can use a vector translation approach to recoding a vector with elucidate::translate()

remotes::install_github("bcgov/elucidate") #if elucidate isn't installed yet
library(dplyr)
library(elucidate)

mutate(iris, 
       cluster = translate(km.out$cluster, 
                           old = c(1:3), 
                           new =  c("versicolor", 
                                    "setosa", 
                                    "virginica")))

Upvotes: 0

dcarlson
dcarlson

Reputation: 11046

It is not clear what you are trying to accomplish. The clusters created by kmeans will not match the Species exactly and there is no guarantee that clusters 1, 2, 3 will match the order of the species in iris. Also as you noted, the results will vary depending on the value of the seed. For example,

set.seed(42)
iris.km <- kmeans(scale(iris[, -5]), 3)
table(iris.km$cluster, iris$Species)
#    
#     setosa versicolor virginica
#   1     50          0         0
#   2      0         39        14
#   3      0         11        36

Cluster 1 is exactly associated with setosa, but cluster 2 combines versicolor and virginica as does cluster 3.

Upvotes: 3

Rui Barradas
Rui Barradas

Reputation: 76402

R is a vectorized language, the following one-liner is equivalent to the code in the question.

km.out$cluster <- c("versicolor", "setosa", "virginica")[km.out$cluster]

Upvotes: 4

Related Questions