Optimizing K (ideal # of clusters) Using PyCluster

Question

I'm using PyCluster's kMeans to cluster some data -- largely because SciPy's kMeans2() produced an insuperable error. Mentioned here. Anyhow the PyCluster kMeans worked well, and I am now attempting to optimize the number of kMeans clusters. PyCluster's accompanying literature suggests that I can optimize its kMeans by implementing an EM algorithm -- bottom of page 13 here -- but I cannot find a single example.

Can someone please point me to a PyCluster k-means optimization problem? Thanks in advance for any help.

David Ding · Accepted Answer

The manual for PyCluster refers to a different optimization problem than the one you are asking about. While you ask how to determine the optimal number of clusters, the manual deals with how to find the optimal clusters given the total number of clusters. The concept to understand is that k-means, which is a type of an EM (Expectation Maximization problem) algorithm, does not guarantee an optimal clustering solution (where an optimal clustering solution can be defined as an assignment of clusters that minimizes the sum of the square of the distances between each data-point and the mean of its cluster). The way k-means works is this:

set cluster means to equal k randomly generated points
while not converged:
     # expectation step:
     for each point:
          assign it to its expected cluster (cluster whose mean it is closest to)
     # maximization step:
     for each cluster:
          # maximizes likelihood for cluster mean
          set cluster mean to be the average of all points assigned to it

The k-means algorithm will output the best solution given the initialization, but it will not necessarily find the best clustering solution globally. This is what the manual is referring to on the bottom of page 13. The manual says that the kcluster routine will perform EM (which is exactly the k-means algorithm) a number of times and select the optimal clustering. It never refered to the problem of finding the optimal number of clusters.

That said, there are a few heuristics you can use to determine the optimal number of clusters (see for instance Wikipedia):

Perhaps the simplest is just to set k=sqrt(n/2), which has often been found to be optimal.
Another approach is to divide your data into two parts, a training set (maybe the first 90% of the data), and a test set (maybe the last 10% of the data). Both sets should be representative of the entire set of data, so you might want to use random.shuffle or random.sample beforehand. Using only the training set, you can apply k-means clustering to find the cluster assignments, from which you can deduce the mean of each cluster. Then, using the test data set, calculate the sum of the squares of the distances between each data point and the mean of its assigned cluster. Finally, if you plot the number of clusters versus the test error, you will (perhaps) find that after a certain value for k, the errors will start increasing, or at least, will stop decreasing. You can then choose the k for which this happens. The use of the test data set will help guarantee that the clustering produced by training is representative of the actual data-set, not the particular training set you happened to sample. If you had n training data points and n clusters, you can of course obtain a perfect clustering on the training set, but the error for the test set might still be large.
Or perhaps you can try the more general mixture of Gaussians model. In the mixture of Gaussians model, there are k Gaussian distributions, N_1, ..., N_k, appearing with weights c_1, ..., c_k, where c_1+...+c_k=1. A data-point is drawn from the Gaussian N_i with probability c_i. The k-means is a special type of a mixture of Gaussians model, where each Gaussian is assumed to be spherical with equal covariances, and with all weights equal. One advantage of this model is that if you see that some of the c_i are really small, then that Gaussian hump might not be a real cluster. To reduce complexity (and the risk of over-fitting), you can constrain the Gaussians to be spherical or to have equal covariances, which gives you a clustering mechanism that behaves almost like k-means except that it shows how important each cluster is.

Optimizing K (ideal # of clusters) Using PyCluster

Answers (1)

Related Questions