Reputation: 336
I have found Okapi Similarity measure can be used to calculated document similarity from here http://www2002.org/CDROM/refereed/643/node6.html and from this paper http://singhal.info/ieee2001.pdf
I want to calculate similarity between documents of a document collection using Okapi similarity scheme with Lucene
e.g. I have 10 documents (doc #A,#B, #C, #D etc.) in my document collection. I ll pick a document as query document. Say doc #A. Then for each term=1..n , of query document I ll calculate the
idfOfQueryTerm = log (totalNumIndexedDocs - docFreq + 0.5)/(docFreq + 0.5)
then I ll take the sum of (idfOfQueryTerm) from 1 to n
; idfOfQueryDoc= sum of (idfOfQueryTerm)
Then for each 10 documents(Including query doc), I l calculate total term frequency of document by this equation, based on the query terms of the query document that was selected first.
tfOfDocument={2.2 * termFrq }/ { 1.2 * ( 0.25 + 0.75 * docLength / this.avgDocLength ) + termFrq }
So I ll end up with 10-tfOfDocument
values, one for each document and one idfOfQueryDoc
value.
Then I can calculate the similarity between query document and other documents using these two methods.
1) Similarity between query doc and doc #B= idfOfQueryDoc* tfOfDocument #B
2) Similarity between query doc and doc #B= idfOfQueryDoc* tfOfDocument #B* tfOfDocument#queryDoc
I want to know, whether my understanding of Okapi Similarity measure is correct?
Which method of above two will be optimal for calculating the doc similarity?
Upvotes: 1
Views: 426
Reputation: 28762
Based on the first link, the similarity between the query document and another document is:
sim(query, doc) = sum(t in terms(query), freq(t, query) * w(t, doc))
where (from the second link, slightly modified as I think the formula in the link is incorrect)
w(t, doc) = idf(t) * (k+1)*freq(t, doc) / (k*(1-b + b*ls(doc)) + freq(t, doc))
ls(doc) = len(doc)/avgdoclen
and idf(t)
is your idfOfQueryTerm
, freq(t, doc)
is the frequency of term t
in document doc
.
Choosing b=0.25 and k = 1.2 you get
w(t, doc) = idf(t) * 2.2*freq(t, doc) / (1.2*(0.25+0.75*ls(doc)) + freq(t, doc))
Note: the two links give slightly different equations, although the differene is mostly in weighing, not the fundamentals
Upvotes: 2