Reputation: 29

The result of kmeans() does not vary from run to run

I am trying to make several kmeans runs, in order to see the different values that totss get. But when I run the following code, I get the same exact result 50 times (n=50).

n= 50
k=1
for (i in c(1:n)){

   set.seed(as.numeric(runif(1))) #random seed

   a <- kmeans(na.omit(data[,c(8,22,23,28)]), centers=2)
   print(a$iter)
   print(a$totss)
   print(a$size)
   print(a$centers)

   k=k+1
   remove(a)
}

Result

*totss      *size1   *size2

64366.21   14080   13061

64366.21   14080   13061

64366.21   14080   13061

64366.21   14080   13061
...

Any idea why this is happening?

Picture: I deleted the set.seed() thing, and printed the a$iter (number of iterations).

Upvotes: 0

Answers (2)

Has QUIT--Anony-Mousse

Reputation: 77495

Of the data is too extreme, then there may be only a single optimum.

In the part of the data that you showed, the first column is constant (= does not matter), the last one is too low in magnitude to matter. And the other two have just two values. So it's almost certain to find this trivial binary split.

So the problem is your data.

Upvotes: 0

Zheyuan Li

Reputation: 73385

set.seed(runif(1)) always gives you set.seed(0). You can try set.seed(i) instead.

You can also just use a single set.seed outside the loop.

I changed runif(1) to runif(1) * 100 but still got the same output for every run.

I added set.seed() because if I drop it, the loop gives me the same result for all iterations.

I understand your point, but the problem is that something is wrong because I am getting the same results in each run / iteration.

Who tells you that kmeans always gives random results? It depends on what your data look like. The following example has unambiguously two clusters so that kmeans would exhibit no randomness.

set.seed(0)
X <- rbind(matrix(rnorm(100), 50), matrix(rnorm(100, 10), 50))
plot(X)

## 50 run
cl <- replicate(50, kmeans(X, 2), FALSE)

## size[1]
sapply(cl, "[[", c(7, 1))
# [1] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
#[26] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50

## size[2]
sapply(cl, "[[", c(7, 2))
# [1] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
#[26] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50

## iter
sapply(cl, "[[", 8)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#[39] 1 1 1 1 1 1 1 1 1 1 1 1

The centers for two clusters are invariant, up to labeling. Sometimes the lower left cluster in the figure is seen as the first cluster, while sometimes the upper right cluster is seen as the first cluster.

## center
ctr <- lapply(cl, "[[", 2)
unique(ctr)
#[[1]]
#        [,1]        [,2]
#1 0.02393097  0.02140593    ## lower left cluster is the 1st cluster
#2 9.78910937 10.11978752
#
#[[2]]
#        [,1]        [,2]
#1 9.78910937 10.11978752    ## upper right cluster is the 1st cluster
#2 0.02393097  0.02140593

If you want to see some uncertainty, try some "ambiguous" data:

X <- matrix(runif(200), 100)
plot(X)

If you ask for 2 clusters from this dataset, kmeans can potentially give a different result on each run. If you ask for 3 clusters, the result is more uncertain.

Remark

Don't compare totss from run to run as it is fixed. Compare withinss or tot.withinss instead which are sensitive to positions of centers.

Upvotes: 4

The result of kmeans() does not vary from run to run

Answers (2)

Related Questions