Reputation: 43
I use StandardAnalyzer to index my text. However, at query time, I am doing term query and phrase query. For term-query and phrase-query, I believe lucene has no issues in calculating the termfrequency and phrase frequency. However, this is fine for models like Dirichlet Similarity. For BM25Similarity or TFIDFSimilarity models, it needs the IDF(term) and IDF(Phrase). How does lucene handle this issue ?
Upvotes: 0
Views: 817
Reputation: 33351
The TFIDFSimilarity phrase IDF is calculated as the sum of the IDFs of it's constituent terms. That is: idf("ab cd") = idf(ab) + idf(cd)
That value is then multiplied by the phrase frequency, and treated very much like a term, for the purposes of scoring.
To see the whole story, I think it makes the most sense to look at an example. IndexSearcher.explain
is a very useful tool for understanding scoring:
The Index:
The Query: "text ab" unique
Explain
output of the first (top scoring) hit (doc 0):
1.3350155 = (MATCH) sum of:
0.7981777 = (MATCH) weight(content:"text ab" in 0) [DefaultSimilarity], result of:
0.7981777 = score(doc=0,freq=1.0 = phraseFreq=1.0
), product of:
0.7732263 = queryWeight, product of:
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.37452745 = queryNorm
1.0322692 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.5 = fieldNorm(doc=0)
0.5368378 = (MATCH) weight(content:unique in 0) [DefaultSimilarity], result of:
0.5368378 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.6341301 = queryWeight, product of:
1.6931472 = idf(docFreq=1, maxDocs=4)
0.37452745 = queryNorm
0.8465736 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.6931472 = idf(docFreq=1, maxDocs=4)
0.5 = fieldNorm(doc=0)
Note, that the first half, dealing with scoring the "text ab"
portion of the query is very much the same algorithm as the second half (scoring unique
), excepting the added summation for the phrase idf calculation.
Explain
output of the second hit (for good measure) (doc 2):
0.49384725 = (MATCH) product of:
0.9876945 = (MATCH) sum of:
0.9876945 = (MATCH) weight(content:"text ab" in 2) [DefaultSimilarity], result of:
0.9876945 = score(doc=2,freq=2.0 = phraseFreq=2.0
), product of:
0.7732263 = queryWeight, product of:
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.37452745 = queryNorm
1.277368 = fieldWeight in 2, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = phraseFreq=2.0
2.0645385 = idf(), sum of:
0.7768564 = idf(docFreq=4, maxDocs=4)
1.287682 = idf(docFreq=2, maxDocs=4)
0.4375 = fieldNorm(doc=2)
0.5 = coord(1/2)
Upvotes: 2