Reputation: 2825
I am trying to use tf-idf to cluster similar documents. One of the major drawback of my system is that it uses cosine similarity to decide which vectors should be group together.
The problem is that cosine similarity does not satisfy triangle inequality. Because in my case I cannot have the same vector in multiple clusters, I have to merge every cluster with an element in common, which can cause two documents to be grouped together even if they're not similar to each other.
Is there another way of measure the similarity of two documents so that:
Upvotes: 2
Views: 2991
Reputation: 1044
Not sure if it can help you. Have a look at TS-SS method in this paper. It covers some drawbacks from Cosine and ED which helps to identify similarity among vectors with higher accuracy. The higher accuracy helps you to understand which documents are highly similar and can be grouped together. The paper shows why TS-SS can help you with that.
Upvotes: 2
Reputation: 77454
Cosine is squared Euclidean on normalized data.
So simply L2 normalize your vectors to unit length, and use Euclidean.
Upvotes: 0