Reputation: 803
I'm using PyCluster's kMeans to cluster some data -- largely because SciPy's kMeans2() produced an insuperable error. Mentioned here. Anyhow the PyCluster kMeans worked well, and I am now attempting to optimize the number of kMeans clusters. PyCluster's accompanying literature suggests that I can optimize its kMeans by implementing an EM algorithm -- bottom of page 13 here -- but I cannot find a single example.
Can someone please point me to a PyCluster k-means optimization problem? Thanks in advance for any help.
Upvotes: 2
Views: 2948
Reputation: 201
The manual for PyCluster refers to a different optimization problem than the one you are asking about. While you ask how to determine the optimal number of clusters, the manual deals with how to find the optimal clusters given the total number of clusters. The concept to understand is that k-means, which is a type of an EM (Expectation Maximization problem) algorithm, does not guarantee an optimal clustering solution (where an optimal clustering solution can be defined as an assignment of clusters that minimizes the sum of the square of the distances between each data-point and the mean of its cluster). The way k-means works is this:
set cluster means to equal k randomly generated points
while not converged:
# expectation step:
for each point:
assign it to its expected cluster (cluster whose mean it is closest to)
# maximization step:
for each cluster:
# maximizes likelihood for cluster mean
set cluster mean to be the average of all points assigned to it
The k-means algorithm will output the best solution given the initialization, but it will not necessarily find the best clustering solution globally. This is what the manual is referring to on the bottom of page 13. The manual says that the kcluster routine will perform EM (which is exactly the k-means algorithm) a number of times and select the optimal clustering. It never refered to the problem of finding the optimal number of clusters.
That said, there are a few heuristics you can use to determine the optimal number of clusters (see for instance Wikipedia):
Upvotes: 7