Reputation: 336
I’m indexing a collection of documents using Lucene by specifying TermVector at indexing time. Then I retrieve terms and their frequencies by reading the index and calculating TF-IDF score vectors for each document. Then, using the TF-IDF vectors, I calculate pairwise cosine similarity between documents using Wikipedia's cosine similarity equation.
This is my problem: Say I have two identical documents “A” and “B” in this collection (A and B have more than 200 sentences). If I calculate pairwise cosine similarity between A and B it gives me cosine value=1 which is perfectly OK. But if I remove a single sentence from Doc “B”, it gives me cosine similarity value around 0.85 between these two documents. The documents are almost similar but cosine values are not. I understand the problem is with the equation that I’m using.
Is there better way / equation that I can use for calculating cosine similarity between documents?
Edited
This is how I calculate Cosine Similarity, doc1[]
and doc2[]
are TF-IDF vectors for corresponding document. the vector contains only the scores
but not the words
private double cosineSimBetweenTwoDocs(float doc1[], float doc2[]) {
double temp;
int doc1Len = doc1.length;
int doc2Len = doc2.length;
float numerator = 0;
float temSumDoc1 = 0;
float temSumDoc2 = 0;
double equlideanNormOfDoc1 = 0;
double equlideanNormOfDoc2 = 0;
if (doc1Len > doc2Len) {
for (int i = 0; i < doc2Len; i++) {
numerator += doc1[i] * doc2[i];
temSumDoc1 += doc1[i] * doc1[i];
temSumDoc2 += doc2[i] * doc2[i];
}
equlideanNormOfDoc1=Math.sqrt(temSumDoc1);
equlideanNormOfDoc2=Math.sqrt(temSumDoc2);
} else {
for (int i = 0; i < doc1Len; i++) {
numerator += doc1[i] * doc2[i];
temSumDoc1 += doc1[i] * doc1[i];
temSumDoc2 += doc2[i] * doc2[i];
}
equlideanNormOfDoc1=Math.sqrt(temSumDoc1);
equlideanNormOfDoc2=Math.sqrt(temSumDoc2);
}
temp = numerator / (equlideanNormOfDoc1 * equlideanNormOfDoc2);
return temp;
}
Upvotes: 3
Views: 6583
Reputation: 368
As I told you in my comment, I think you made a mistake somewhere.
The vectors actually contain the <word,frequency>
pairs, not words
only.
Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted).
Consider the following example:
Document a:
A B C A A B C. D D E A B. D A B C B A.
Document b:
A B C A A B C. D A B C B A.
Vector a:
A:6, B:5, C:3, D:3, E:1
Vector b:
A:5, B:4, C:3, D:1, E:0
Which result in the following similarity measure:
(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=
62/(8.94427*7.14143)=
0.970648
Edit I think your source code is not working as well. Consider the following code which works fine with the above example:
import java.util.HashMap;
import java.util.Map;
public class DocumentVector {
Map<String, Integer> wordMap = new HashMap<String, Integer>();
public void incCount(String word) {
Integer oldCount = wordMap.get(word);
wordMap.put(word, oldCount == null ? 1 : oldCount + 1);
}
double getCosineSimilarityWith(DocumentVector otherVector) {
double innerProduct = 0;
for(String w: this.wordMap.keySet()) {
innerProduct += this.getCount(w) * otherVector.getCount(w);
}
return innerProduct / (this.getNorm() * otherVector.getNorm());
}
double getNorm() {
double sum = 0;
for (Integer count : wordMap.values()) {
sum += count * count;
}
return Math.sqrt(sum);
}
int getCount(String word) {
return wordMap.containsKey(word) ? wordMap.get(word) : 0;
}
public static void main(String[] args) {
String doc1 = "A B C A A B C. D D E A B. D A B C B A.";
String doc2 = "A B C A A B C. D A B C B A.";
DocumentVector v1 = new DocumentVector();
for(String w:doc1.split("[^a-zA-Z]+")) {
v1.incCount(w);
}
DocumentVector v2 = new DocumentVector();
for(String w:doc2.split("[^a-zA-Z]+")) {
v2.incCount(w);
}
System.out.println("Similarity = " + v1.getCosineSimilarityWith(v2));
}
}
Upvotes: 5