Reputation: 5723
I see that for k-means, we have Lloyd's algorithm, Elkan's algorithm, and we also have hierarchical version of k-means.
For all these algorithms, I see that Elkan's algorithm could provide a boost in term of speed. But what I want to know, is the quality from all these k-means algorithms. Each time, we run these algorithms, the result would be different, due to their heuristic and probabilistic nature. Now, my question is, when it comes to clustering algorithm like k-means, if we want to have a better quality result (as in lesser distortion, etc.) between all these k-means algorithms, which algorithm would be able to give you the better quality? Is it possible to measure such thing?
Upvotes: 4
Views: 575
Reputation: 556
To compare the quality, you should have a labeled dataset and measure the results by some criteria like NMI
Upvotes: 0
Reputation: 14309
How about the pathological case of the two-moons dataset? unsupervised k-means will fail badly. A high quality method I am aware of employs a more probabilistic approach using mutual information and combinatorial optimization. Basically you cast the clustering problem as the problem of finding the optimal [cluster] subset of the full point-set for the case of two clusters.
You can find the relevant paper here (page 42) and the corresponding Matlab code here to play with (checkout the two-moons case). If you are interested in a C++ high-performance implementation of that with a speed up of >30x then you can find it here HPSFO.
Upvotes: 1
Reputation: 178411
A better solution is usually one that has a better (lower) J(x,c)
value, where:
J(x,c) = 1/|x| * Sum(distance(x(i),c(centroid(i)))) for each i in [1,|x|]
Wherre:
x
is the list of samples|x|
is the size of x
(number of elements)[1,|x|]
all the numbers from 1 to |x|
(inclusive)c
is the list of centroids (or means) of clusters (i.e., for k
clusters |c| = k)distance(a,b)
(sometimes denoted as ||a-b|| is the distance between "point" a to "point" b (In euclidean 2D space it is sqrt((a.x-b.x)^2 + (a.y-b.y)^2)
)x(i)
Note that this approach does not require switching to supervised technique and can be fully automated!
Upvotes: 4
Reputation: 9413
As I understand it, you need some data with labels to cross-validate you clustering algorithm.
Upvotes: 1