lucene/solr norm: avoid short length fields to rank inappropriately high

Question

Using norm when indexing is great, my problem is that very short fields rank inappropriately high. Example:

doc1 : tf(200) out of 1.000 
doc2 : tf(150) out of 500

doc2 will score higher and its great.

Problem is when I have:

doc3 : tf(3) out of 4

which is not great in my case because it's a very rare document, let's say an exception.

I've read KinoSearch or someone suggesting to introduce a constant to kind of offset this issue. Any ideas on how I can still leverage full power of using norm and avoid this issue?

Thanks

femtoRgon · Accepted Answer

You can create your own Similarity class, extending DefaultSimilarity, and simply override the lengthNorm method. The default lengthNorm implementation is pretty simple really:

public float lengthNorm(FieldInvertState state) {
    final int numTerms;
    if (discountOverlaps)
        numTerms = state.getLength() - state.getNumOverlap();
    else
        numTerms = state.getLength();
    return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
}

Replace it with whatever algorithm makes sense in your case. Really, the last line there is probably all you really need to worry about modifying, particularly 1.0 / Math.sqrt(numTerms). Two things to keep in mind here:

Norms are compressed in a very lossy fashion (about 1 significant decimal digit!) to conserve space. Big differences matter, minor tweaks will tend to get lost.
You will need to re-index. Norms are stored at index time, rather than calculated at query time.

You can set Solr to use your Similarity in your schema, like:

lucene/solr norm: avoid short length fields to rank inappropriately high

Answers (1)

Related Questions