Ideal number of clusters in Weka K-means

Question

I am using Weka's SimpleKMeans function to cluster 96000 terms(as word). Weka takes the number of desired cluster number as parameter. So, it gives 2 to num. of clusters default. The dataset I have is 96000x641000 sparse dataset. At the beginning I gave thu cluster number 10000 but I think it is too much for recommendation process. Is there an approach to calculate #of clusters respect to an algorithm or find the ideal #of clusters?

Has QUIT--Anony-Mousse · Accepted Answer

K-means is not really designed for sparse data. Plus, it is designed for euclidean distance, and you should be aware that this is not a good choice for high-dimensional data.

Maybe the simplest argument is as follows: The mean of a subset will likely no longer be sparse, so it will be anomalous itself, and closer to the center than the actual data instances. This however means that the means of different clusters will likely be closer to each other than the actual instances to their means, which makes the result highly dubious.

You should at least try k-medians instead (but it is a lot slower), or other measures to preserve sparsity for the means, too. Sure: k-means does cluster the data. The question is, how valid the result is.

Ideal number of clusters in Weka K-means

Answers (2)

Related Questions