Cosine similarity alternative for tf-idf (triangle inequality)

I am trying to use tf-idf to cluster similar documents. One of the major drawback of my system is that it uses cosine similarity to decide which vectors should be group together.

The problem is that cosine similarity does not satisfy triangle inequality. Because in my case I cannot have the same vector in multiple clusters, I have to merge every cluster with an element in common, which can cause two documents to be grouped together even if they're not similar to each other.

Is there another way of measure the similarity of two documents so that:

Vectors score as very similar based on their direction regardless of their magnitude
Satisfy triangle inequality: if A is similar to B and B is similar to C then A is also similar to C

Upvotes: 2

Answers (2)

Arash

Reputation: 1044

Not sure if it can help you. Have a look at TS-SS method in this paper. It covers some drawbacks from Cosine and ED which helps to identify similarity among vectors with higher accuracy. The higher accuracy helps you to understand which documents are highly similar and can be grouped together. The paper shows why TS-SS can help you with that.

Upvotes: 2

Has QUIT--Anony-Mousse

Reputation: 77454

Cosine is squared Euclidean on normalized data.

So simply L2 normalize your vectors to unit length, and use Euclidean.

Upvotes: 0

Cosine similarity alternative for tf-idf (triangle inequality)

Answers (2)

Related Questions