Reputation: 19
We have a lot of documents in SOLR and a certain type of them tend to score too highly in results (it appears mainly due them generally being quite short in content). So if I search for a name it will always return a load of short documents before anything longer.
How can I weight results so that the length of the document is taken more into account when ranking for relevance?
If it helps (as a cludge) we have a flag set on the documents this generally applies to, so if it is possible to boost all documents who don't have this flag set that would be a temporary option for us.
Upvotes: 0
Views: 649
Reputation: 52769
Check the source of DefaultSimilarity for 4.0
@Override
public void computeNorm(FieldInvertState state, Norm norm) {
final int numTerms;
if (discountOverlaps)
numTerms = state.getLength() - state.getNumOverlap();
else
numTerms = state.getLength();
norm.setByte(encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)))));
}
So the numTerms have an adverse impact on the scoring.
You can create a Custom class overriding the behaviour
((float) (1.0 / Math.sqrt(numTerms)))
to eliminate lengthNorm effect. Upvotes: 0
Reputation: 33341
This is caused by the lengthNorm in scoring. Longer documents with the same matching terms receive a somewhat lower score than short documents. See TFIDFSimilarity's documentation (scroll down to "6. norm(t,d)
"), as well as in Solr documentation here.
This tends to work well for full-text searching applications. The idea being that the document with the higher proportion of it's content matching the query is more relevant to the query.
For instance, if I search wikipedia article titles for the term Monkey, relevance of articles found might be:
If it's really necessary, this can be overridden in a custom DefaultSimilarity. And overriding computeNorm(state,norm) to simply return state.getBoost();
.
Upvotes: 1