getting the document length during query evaluation apache lucene 5.3

Question

I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.

femtoRgon · Accepted Answer

You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.

In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.

If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.

getting the document length during query evaluation apache lucene 5.3

Answers (1)

Related Questions