k-means clustering with new training data?

Question

I'm working on some image recognition stuff and are trying to use k-means for matching algorithms.

Actually, I have lots of vectors (exactly speaking, SURF descriptors) on database and I would like to cluster them for future matching processes.

However, the problem is, I believe that the training dataset is going to grow (new training data may come), which make it impossible for me to train these data in one run.

It would be OK to cluster some data first, but does it mean that every new data need a full re-clustering? If I'm confident enough on existing clusters, does a minority of extra data (ex. 1% extra of all data) hurt the cluster?

Has QUIT--Anony-Mousse · Accepted Answer

K-means is not a particularly smart algorithm. And on SIFT vectors, the results are often not much better than random convex partitions anyway.

If your initial sample was representative, there should be no need to rerun the clustering: the new data should have little effect on the centroids anyway.

To speed up the clustering, you can also re-use the previous centroids as initial seeds. This should require much less iterations then.

k-means clustering with new training data?

Answers (1)

Related Questions