Euclidean vs Cosine for text data

Question

IF I use tf-idf feature representation (or just document length normalization), then is euclidean distance and (1 - cosine similarity) basically the same? All text books I have read and other forums, discussions say cosine similarity works better for text...

I wrote some basic code to test this and found indeed they are comparable, not exactly same floating point value but it looks like a scaled version. Given below are the results of both the similarities on simple demo text data. text no.2 is a big line of about 50 words, rest are small 10 word lines.

Cosine similarity: 0.0, 0.2967, 0.203, 0.2058

Euclidean distance: 0.0, 0.285, 0.2407, 0.2421

Note: If this question is more suitable to Cross Validation or Data Science, please let me know.

Has QUIT--Anony-Mousse · Accepted Answer

If your data is normalized to unit length, then it is very easy to prove that

Euclidean(A,B) = 2 - Cos(A,B)

This does hold if ||A||=||B||=1. It does not hold in the general case, and it depends on the exact order in which you perform your normalization steps. I.e. if you first normalize your document to unit length, next perform IDF weighting, then it will not hold...

Unfortunately, people use all kinds of variants, including quite different versions of IDF normalization.

Euclidean vs Cosine for text data

Answers (1)

Related Questions