Edgar Derby
Edgar Derby

Reputation: 2825

Cosine similarity alternative for tf-idf (triangle inequality)

I am trying to use tf-idf to cluster similar documents. One of the major drawback of my system is that it uses cosine similarity to decide which vectors should be group together.

The problem is that cosine similarity does not satisfy triangle inequality. Because in my case I cannot have the same vector in multiple clusters, I have to merge every cluster with an element in common, which can cause two documents to be grouped together even if they're not similar to each other.

Is there another way of measure the similarity of two documents so that:

Upvotes: 2

Views: 2991

Answers (2)

Arash
Arash

Reputation: 1044

Not sure if it can help you. Have a look at TS-SS method in this paper. It covers some drawbacks from Cosine and ED which helps to identify similarity among vectors with higher accuracy. The higher accuracy helps you to understand which documents are highly similar and can be grouped together. The paper shows why TS-SS can help you with that.

enter image description here

Upvotes: 2

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Cosine is squared Euclidean on normalized data.

So simply L2 normalize your vectors to unit length, and use Euclidean.

Upvotes: 0

Related Questions