mahalanobis distance vs euclidean distance in Vector Quantization

Question

I've done Kmeans clustering in OpenCV using C++ and have 12 cluster centers (each in 200 dimensions).

Now, I have a set of points in 200 dimensions and I'm trying to find the closest cluster (Vector Quantization).

Which distance is preferred over the other (Mahalanobis distance or Euclidean distance) ? Currently I'm using Euclidean distance.

jilles de wit · Accepted Answer

Andrey's point is a valid one. I can add a general statement:

For Mahalanobis distance you need to be able to properly estimate the covariance matrix for each cluster. With 200 dimensions the only way you can expect a reasonable estimate for the covariance matrix cluster is with something in the order of several hundreds to thousands of datapoints. Add to that the 12 clusters you have and you easily need tens of thousands of datapoints to reasonably use Mahalanobis distance.

Apart from that: try how Euclidean distance works for you. If results are reasonable, just stick to that, otherwise try Mahalanobis.

Finally, you might find more knowledgeable people on this subject on the stats stackexchange.

mahalanobis distance vs euclidean distance in Vector Quantization

Answers (2)

Related Questions