Reputation: 23
I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
Upvotes: 1
Views: 189
Reputation: 33351
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity
implementation, this should go into the ComputeNorm
method, which exposes the information you need through the FieldInvertState
, particularly FieldInvertState.getLength()
. Norms are made available at search time through LeafReader.GetNormValues
.
If you are extending TFIDFSimilarity
, instead, you just need to implement the encodeNormValue
, decodeNormValue
and lengthNorm
methods.
Upvotes: 1