Improved K-means clustering (Ward criterion) speed improvement

Question

I use k-means clustering with random initialization for clusters identification. Algorithm works well for nice data. But if I work with data with many noise, then my k-means algorithm looses its robustness and gives different solution for every run on same data set.

So I decided to improve my k-means clustering to minimize Ward criterion:

Do the standard k-means clustering
Go through points and all clusters and find point P and clusters A and B, so as if I moved point P from cluster A to cluster B, then ward criterion for that clustering will be minimal
If such point was found, move it from A to B, update cluster centers and continue with 2

I wrote this algorithm in c++ here. However, problem is, that this approach is extremely slow, I am dealing with clusters with circa 20 000 points per each.

Can you suggest to me a better solution, or could you help me speed up this algorithm?

Michal · Accepted Answer

I finally found the solution. I've realized that:

My approach with Ward was really useless
PCA was unusable for me, because I am working only with 1D clusters.
After I had implemented k-means++, as Micka said, reliability of k-means was improved. Nevertheless, occasionally it was still giving bad solutions. (experimentally 1 from 5 clusterings on same data was bad)

What definitely helped me was Mean normalization. I did 5x k-means, calculated mean for cluster centers from each iteration. And finally run k-means with calculated means as initial solution.

Improved K-means clustering (Ward criterion) speed improvement

Answers (1)

Related Questions