Reputation: 29
I am trying to make several kmeans runs, in order to see the different values that totss
get. But when I run the following code, I get the same exact result 50 times (n=50).
n= 50
k=1
for (i in c(1:n)){
set.seed(as.numeric(runif(1))) #random seed
a <- kmeans(na.omit(data[,c(8,22,23,28)]), centers=2)
print(a$iter)
print(a$totss)
print(a$size)
print(a$centers)
k=k+1
remove(a)
}
Result
*totss *size1 *size2
64366.21 14080 13061
64366.21 14080 13061
64366.21 14080 13061
64366.21 14080 13061
...
Any idea why this is happening?
Picture: I deleted the set.seed() thing, and printed the a$iter
(number of iterations).
Upvotes: 0
Views: 418
Reputation: 77495
Of the data is too extreme, then there may be only a single optimum.
In the part of the data that you showed, the first column is constant (= does not matter), the last one is too low in magnitude to matter. And the other two have just two values. So it's almost certain to find this trivial binary split.
So the problem is your data.
Upvotes: 0
Reputation: 73385
set.seed(runif(1))
always gives you set.seed(0)
. You can try set.seed(i)
instead.
You can also just use a single set.seed
outside the loop.
I changed
runif(1)
torunif(1) * 100
but still got the same output for every run.I added
set.seed()
because if I drop it, the loop gives me the same result for all iterations.I understand your point, but the problem is that something is wrong because I am getting the same results in each run / iteration.
Who tells you that kmeans
always gives random results? It depends on what your data look like. The following example has unambiguously two clusters so that kmeans
would exhibit no randomness.
set.seed(0)
X <- rbind(matrix(rnorm(100), 50), matrix(rnorm(100, 10), 50))
plot(X)
## 50 run
cl <- replicate(50, kmeans(X, 2), FALSE)
## size[1]
sapply(cl, "[[", c(7, 1))
# [1] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
#[26] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
## size[2]
sapply(cl, "[[", c(7, 2))
# [1] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
#[26] 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
## iter
sapply(cl, "[[", 8)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#[39] 1 1 1 1 1 1 1 1 1 1 1 1
The centers for two clusters are invariant, up to labeling. Sometimes the lower left cluster in the figure is seen as the first cluster, while sometimes the upper right cluster is seen as the first cluster.
## center
ctr <- lapply(cl, "[[", 2)
unique(ctr)
#[[1]]
# [,1] [,2]
#1 0.02393097 0.02140593 ## lower left cluster is the 1st cluster
#2 9.78910937 10.11978752
#
#[[2]]
# [,1] [,2]
#1 9.78910937 10.11978752 ## upper right cluster is the 1st cluster
#2 0.02393097 0.02140593
If you want to see some uncertainty, try some "ambiguous" data:
X <- matrix(runif(200), 100)
plot(X)
If you ask for 2 clusters from this dataset, kmeans
can potentially give a different result on each run. If you ask for 3 clusters, the result is more uncertain.
Remark
Don't compare totss
from run to run as it is fixed. Compare withinss
or tot.withinss
instead which are sensitive to positions of centers.
Upvotes: 4