JoshuaJeanThree
JoshuaJeanThree

Reputation: 1392

Ideal number of clusters in Weka K-means

I am using Weka's SimpleKMeans function to cluster 96000 terms(as word). Weka takes the number of desired cluster number as parameter. So, it gives 2 to num. of clusters default. The dataset I have is 96000x641000 sparse dataset. At the beginning I gave thu cluster number 10000 but I think it is too much for recommendation process. Is there an approach to calculate #of clusters respect to an algorithm or find the ideal #of clusters?

Upvotes: 1

Views: 3788

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

K-means is not really designed for sparse data. Plus, it is designed for euclidean distance, and you should be aware that this is not a good choice for high-dimensional data.

Maybe the simplest argument is as follows: The mean of a subset will likely no longer be sparse, so it will be anomalous itself, and closer to the center than the actual data instances. This however means that the means of different clusters will likely be closer to each other than the actual instances to their means, which makes the result highly dubious.

You should at least try k-medians instead (but it is a lot slower), or other measures to preserve sparsity for the means, too. Sure: k-means does cluster the data. The question is, how valid the result is.

See also:

k-means clustering in R on very large, sparse matrix?

Clustering of sparse matrix in python and scipy

Distance Metric for clustering elements in a sparse matrix

clustering on very large sparse matrix?

K-means clustering algorithm run time and complexity

How to do K-means with normalized TF-IDF

Mahout binary data clustering

For a number of failure stories (= questions without a good answer) of running k-means on high-dimensional sparse / binary data.

Upvotes: 1

Abinash Koirala
Abinash Koirala

Reputation: 987

For K-means variant algorithms there is a rule of thumb for the initial prediction of 'k'. Generally it is suitable to take k = (n / 2) ^ 0.5 where n = number of data points.

Upvotes: 1

Related Questions