Reputation: 1191
Hi I have clustered some data with kmeans function and stored the centers of clusters that it produces as output. Now I have a new set of vectors in a Mat object and want to know to which cluster each vector belongs in. Is there a simple way to do that or should I just calculate the euclidean distances of each vector with all the centers and choose the cluster it is closest to.
If I should go for the second way, are there any efficiency considerations to make it fast?
Upvotes: 0
Views: 493
Reputation: 8022
It seems that you're interested in performing some type of cluster assignment using the results of running K-Means on an initial data set, right?
You could just assign the new observation to the closest mean. Unfortunately with K-Means you don't know anything about the shapes or size of each cluster. For example, consider a scenario where a new vector is equidistant (or roughly equidistant) from two means. What do yo do in this scenario? Do you make a hard assignment to one of the clusters?
In this situation its probably better to actually look at the original data that comprises each of the clusters, and do some type of K-Nearest Neighbor assignment (http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). For example, it may turn out that while the new vector is roughly equidistant from two different cluster centers, it is much closer to the data from one of the clusters (indicating that it likely belongs to that cluster).
As an alternative to K-Means, if you used some like Mixture of Gaussians with EM, you'd not only have a set of cluster centers (as you do with K-Means), but also a variance, describing size of the cluster. For each new observation, you could then compute the probability that it belongs to each cluster without revisiting the data from each cluster (as it's baked in to the MoG EM model).
Upvotes: 1