Reputation: 359
Using norm when indexing is great, my problem is that very short fields rank inappropriately high. Example:
doc1 : tf(200) out of 1.000
doc2 : tf(150) out of 500
doc2 will score higher and its great.
Problem is when I have:
doc3 : tf(3) out of 4
which is not great in my case because it's a very rare document, let's say an exception.
I've read KinoSearch or someone suggesting to introduce a constant to kind of offset this issue. Any ideas on how I can still leverage full power of using norm and avoid this issue?
Thanks
Upvotes: 0
Views: 254
Reputation: 33351
You can create your own Similarity
class, extending DefaultSimilarity
, and simply override the lengthNorm
method. The default lengthNorm implementation is pretty simple really:
public float lengthNorm(FieldInvertState state) {
final int numTerms;
if (discountOverlaps)
numTerms = state.getLength() - state.getNumOverlap();
else
numTerms = state.getLength();
return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
}
Replace it with whatever algorithm makes sense in your case. Really, the last line there is probably all you really need to worry about modifying, particularly 1.0 / Math.sqrt(numTerms)
. Two things to keep in mind here:
You can set Solr to use your Similarity in your schema, like:
<similarity class="this.is.my.CustomSimilarity"/>
Upvotes: 2