Stormy S
Stormy S

Reputation:

Consistent Cluster Order with Kmeans in R

This might not be possible, but Google has failed me so far so I'm hoping someone else might have some insight. Sorry if this has been asked before.

The background is, I have a database of information on different cities, so like name, population, pollution, crime, etc by year. I'm querying it to aggregate the data on a per-city basis and outputting the result to a table. That works fine.

The next step is I'm running the kmeans() function in R on the data set to find clusters, in testing I've found that 5 clusters is almost always a good choice via the "elbow method".

The issue I'm having is that these clusters have distinct meanings/interpretations, so I want to tag each row in the original data set with the cluster's interpretation for that row, not the cluster number. So I don't want to identify row 2 with "cluster 5", I want to say "low population, high crime, low income".

If R would output the clusters in the same order, say having cluster 5 always equate to the cluster of cities with "low population, high crime, low income", that would work fine, but it doesn't. For instance, if you run code like this:

> a =  kmeans(city_date,centers=5)
> b =  kmeans(city_date,centers=5)
> c =  kmeans(city_date,centers=5)

The run this code:

a$centers
b$centers
c$centers

The clusters will all contain the same data set, but the cluster number will be different. So if I have a mapping table in SQL that has cluster number and interpretation, it won't work, because when I run it one day it might have the "low population, high crime, low income" cluster as 5, and the next it might be 2, the next 4, etc.

What I'm trying to figure out is if there is a way to keep the output consistent. The data set gets updated so it won't even be the same every time, and since R doesn't keep the cluster order consistent even with the same data set, I am wondering if it will be possible at all.

Thanks for any help anyone can provide. On my end my current idea is to output the $centers data to a SQL table, then order the table by the various metrics, each time the one with the highest/lowest getting tagged as such, and then concatenating the results to tag the level. This may work but isn't very elegant.

Upvotes: 9

Views: 8035

Answers (5)

jakub
jakub

Reputation: 5104

This function runs kmeans with 1-dimensional input and returns a normal "kmeans" object with sensibly numbered clusters, without having to run the kmeans twice.

ordered_kmeans = function(x, centers, iter.max = 10, nstart = 1,
                          algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                                        "MacQueen"), 
                          trace = FALSE,
                          desc = TRUE) {

  if (NCOL(x) > 1) {
    stop("only one-dimensional inputs are allowed")
  }
  
  k = kmeans(x = x, centers = centers, iter.max = iter.max, nstart = nstart,
             algorithm = algorithm, trace = trace)
  
  centers_ind = order(k$centers, decreasing = desc)
  
  centers_ord = setNames(seq_along(k$centers), nm = centers_ind)
  
  k$cluster  = unname(centers_ord[as.character(k$cluster)])
  k$centers  = matrix(k$centers[centers_ind], ncol = 1)
  k$withinss = k$withinss[centers_ind]
  k$size     = k$size[centers_ind]
 
  k
}

Example usage:

vec = c(20.28, 9.49, 7.14, 2.48, 2.36, 1.82, 1.3, 1.26, 1.11, 0.98, 
        0.81, 0.73, 0.66, 0.63, 0.57, 0.53, 0.44, 0.42, 0.38, 0.37, 0.33, 
        0.29, 0.28, 0.27, 0.26, 0.23, 0.23, 0.2, 0.18, 0.16, 0.15, 0.14, 
        0.14, 0.12, 0.11, 0.1, 0.1, 0.08)

# For comparispon
set.seed(1)
k = kmeans(vec, centers = 3); k

set.seed(1)
k = ordered_kmeans(vec, centers = 3); k

set.seed(1)
k = ordered_kmeans(vec, centers = 3, desc = FALSE); k

Upvotes: 2

ndimhypervol
ndimhypervol

Reputation: 509

Here's an example where you ascribe letter factor groups to the k-means clusters, ordered from A is low to C is high. The parameters can be altered to fit the data you have.

df <- data.frame(id = 1:10, obs = sample(0:500, 10))
km <- kmeans(df$obs, centers = 3)
km.order <- as.numeric(names(sort(km$centers[,1])))
names(km.order) <- toupper(letters)[1:3]
km.order <- sort(km.order)
clus.order <- factor(names(km.order[km$cluster]))

Upvotes: 0

RogB
RogB

Reputation: 461

I know this is a very old post, but I only came across it now. I had the same problem today and adapted the suggestion by Barker to come up with a solution:

library(dplyr)

# create a random data frame
df <- data.frame(id = 1:10, obs = sample(0:500, 10))

# use kmeans a first time to get the centers
centers <- kmeans(df$obs, centers = 3)$centers

# order the centers
centers <- sort(centers)

# call kmeans again but this time passing the centers calculated in the previous step
clusteridx <- kmeans(df$obs, centers = centers)$cluster

Not very elegant, but it works. The clusteridx vector will always return the cluster number based on the centers in ascending order.

This can also be collapsed into just one line if you prefer:

clusteridx <- kmeans(df$obs, centers = sort(kmeans(df$obs, centers = 3)$centers))$cluster

Upvotes: 11

Barker
Barker

Reputation: 2094

I haven't done this myself so I am not sure it will work, but kmeans has the parameter:

  • centers - either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.

If you know know basically where the clusters should be (perhaps by getting the cluster centers from a dataset you are matching to), you could use that to initialize the model. That would make the starting locations non-random, so the clusters should stay in the same order. Also, as an added benefit, initializing the cluster centers close to where they will end up should speed up your clustering.

Edit

I just checked using the data from the kmeans example but initializing with the first datapoint at (1,1) and the second at (0,0) (the means of the distributions used to makes the clusters) as below.

x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, matrix(c(1,0,1,0),ncol=2)))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)

After repeated runs, I found that the first cluster was always in the top right and the second in the bottom left where as initializing with 2 clusters caused then to switch back and forth. If you have some approximate starting values for your clusters (ie quantification for "low population, high crime, low income") that could be your initialization and give you the results you want.

Upvotes: 1

piotrpo
piotrpo

Reputation: 12636

Usually k-means are initialized randomly few times to avoid local minimums. If you want to have resulting clusters ordered, you have to order them manually after k-means algorithm stops to work.

Upvotes: 1

Related Questions