Lucene scoring for Phrase Query

Question

I use StandardAnalyzer to index my text. However, at query time, I am doing term query and phrase query. For term-query and phrase-query, I believe lucene has no issues in calculating the termfrequency and phrase frequency. However, this is fine for models like Dirichlet Similarity. For BM25Similarity or TFIDFSimilarity models, it needs the IDF(term) and IDF(Phrase). How does lucene handle this issue ?

femtoRgon · Accepted Answer

The TFIDFSimilarity phrase IDF is calculated as the sum of the IDFs of it's constituent terms. That is: idf("ab cd") = idf(ab) + idf(cd)

That value is then multiplied by the phrase frequency, and treated very much like a term, for the purposes of scoring.

To see the whole story, I think it makes the most sense to look at an example. IndexSearcher.explain is a very useful tool for understanding scoring:

The Index:

doc 0: text ab unique
doc 1: text
doc 2: text ab cd text ab
doc 3: text

The Query: "text ab" unique

Explain output of the first (top scoring) hit (doc 0):

1.3350155 = (MATCH) sum of:
  0.7981777 = (MATCH) weight(content:"text ab" in 0) [DefaultSimilarity], result of:
    0.7981777 = score(doc=0,freq=1.0 = phraseFreq=1.0
), product of:
      0.7732263 = queryWeight, product of:
        2.0645385 = idf(), sum of:
          0.7768564 = idf(docFreq=4, maxDocs=4)
          1.287682 = idf(docFreq=2, maxDocs=4)
        0.37452745 = queryNorm
      1.0322692 = fieldWeight in 0, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = phraseFreq=1.0
        2.0645385 = idf(), sum of:
          0.7768564 = idf(docFreq=4, maxDocs=4)
          1.287682 = idf(docFreq=2, maxDocs=4)
        0.5 = fieldNorm(doc=0)
  0.5368378 = (MATCH) weight(content:unique in 0) [DefaultSimilarity], result of:
    0.5368378 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
      0.6341301 = queryWeight, product of:
        1.6931472 = idf(docFreq=1, maxDocs=4)
        0.37452745 = queryNorm
      0.8465736 = fieldWeight in 0, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        1.6931472 = idf(docFreq=1, maxDocs=4)
        0.5 = fieldNorm(doc=0)

Note, that the first half, dealing with scoring the "text ab" portion of the query is very much the same algorithm as the second half (scoring unique), excepting the added summation for the phrase idf calculation.

Explain output of the second hit (for good measure) (doc 2):

0.49384725 = (MATCH) product of:
  0.9876945 = (MATCH) sum of:
    0.9876945 = (MATCH) weight(content:"text ab" in 2) [DefaultSimilarity], result of:
      0.9876945 = score(doc=2,freq=2.0 = phraseFreq=2.0
), product of:
        0.7732263 = queryWeight, product of:
          2.0645385 = idf(), sum of:
            0.7768564 = idf(docFreq=4, maxDocs=4)
            1.287682 = idf(docFreq=2, maxDocs=4)
          0.37452745 = queryNorm
        1.277368 = fieldWeight in 2, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = phraseFreq=2.0
          2.0645385 = idf(), sum of:
            0.7768564 = idf(docFreq=4, maxDocs=4)
            1.287682 = idf(docFreq=2, maxDocs=4)
          0.4375 = fieldNorm(doc=2)
  0.5 = coord(1/2)

Lucene scoring for Phrase Query

Answers (1)

Related Questions