Reputation: 27
Suppose we are trying to measure similarity between two very similar documents.
Document A: "a b c d"
Document B: "a b c e"
This corresponds to a term-frequency matrix
a b c d e
A 1 1 1 1 0
B 1 1 1 0 1
where the cosine similarity on the raw vectors is the dot product of the two vectors A and B, divided by the product of their magnitudes:
3/4 = (1*1 + 1*1 + 1*1 + 1*0 + 1*0) / (sqrt(4) * sqrt(4)).
But when we apply an inverse document frequency transformation by multiplying each term in the matrix by (log(N / df_i), where N is the number of documents in the matrix, 2, and df_i is the number of documents in which a term is present, we get a tf-idf matrix of
a b c d e
A: 0 0 0 log2 0
B: 0 0 0 0 1og2
Since "a" appears in both documents, it has an inverse-document-frequency value of 0. This is the same for "b" and "c". Meanwhile, "d" is in document A, but not in document B, so it is multiplied by log(2/1). "e" is in document B, but not in document A, so it is also multiplied by log(2/1).
The cosine similarity between these two vectors is 0, suggesting the two are totally different documents. Obviously, this is incorrect. For these two documents to be considered similar to each other using tf-idf weightings, we would need a third document C in the matrix which is vastly different from documents A and B.
Thus, I am wondering whether and/or why we would use tf-idf weightings in combination with a cosine similarity metric to compare highly similar documents. None of the tutorials or StackOverflow questions I've read have been able to answer this question.
This post discusses similar failings with tf-idf weights using cosine similarities, but offers no guidance on what to do about them.
EDIT: as it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.
Upvotes: 0
Views: 2519
Reputation: 3740
Since "a" appears in both documents, it has an inverse-document-frequency value of 0
This is where you have made an error in using inverse document frequency (idf). Idf is meant to be computed over a large collection of documents (not just across two documents), the purpose being to be able to predict the importance of term overlaps in document pairs.
You would expect that common terms, such as 'the', 'a' etc. overlap across all document pairs. Should that be having any contribution to your similarity score? - No.
That is precisely the reason why the vector components are multiplied by the idf factor - just to dampen or boost a particular term overlap (a component of the form a_i*b_i being added to the numerator in the cosine-sim sum).
Now consider you have a collection on computer science journals. Do you believe that an overlap of terms such as 'computer' and 'science' across a document pair is considered to be important? - No. And this will indeed happen because the idf of these terms would be considerably low in this collection.
What do you think will happen if you extend the collection to scientific articles of any discipline? In that collection, the idf value of the word 'computer' will no longer be low. And that makes sense because in this general collection, you would like to think that two documents are similar enough if they are on the same topic - computer science.
Upvotes: 2
Reputation: 27
As it turns out, the guidance I was looking for was in the comments of that blog post. It recommends using the formula
1 + log ( N / ni + 1)
as the inverse document frequency transformation instead. This would keep the weights of terms which are in every document close to their original weights, while inflating the weights of terms which are not present in a lot of documents by a greater degree. Interesting that this formula is not more prominently found in posts about tf-idf.
Upvotes: 0