Reputation: 19

SOLR relevance seems tied heavily to length of document indexed

We have a lot of documents in SOLR and a certain type of them tend to score too highly in results (it appears mainly due them generally being quite short in content). So if I search for a name it will always return a load of short documents before anything longer.

How can I weight results so that the length of the document is taken more into account when ranking for relevance?

If it helps (as a cludge) we have a flag set on the documents this generally applies to, so if it is possible to boost all documents who don't have this flag set that would be a temporary option for us.

Upvotes: 0

Answers (2)

Jayendra

Reputation: 52809

Check the source of DefaultSimilarity for 4.0

@Override
public void computeNorm(FieldInvertState state, Norm norm) {
    final int numTerms;
    if (discountOverlaps)
        numTerms = state.getLength() - state.getNumOverlap();
    else
        numTerms = state.getLength();
    norm.setByte(encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)))));
}

So the numTerms have an adverse impact on the scoring.
You can create a Custom class overriding the behaviour

numTerms equal to 1
Change the Calculation to increase the score on longer documents rather the inverse now
Remove the calculation ((float) (1.0 / Math.sqrt(numTerms))) to eliminate lengthNorm effect.

Upvotes: 0

femtoRgon

Reputation: 33351

This is caused by the lengthNorm in scoring. Longer documents with the same matching terms receive a somewhat lower score than short documents. See TFIDFSimilarity's documentation (scroll down to "6. norm(t,d)"), as well as in Solr documentation here.

This tends to work well for full-text searching applications. The idea being that the document with the higher proportion of it's content matching the query is more relevant to the query.

For instance, if I search wikipedia article titles for the term Monkey, relevance of articles found might be:

Monkey - Precise match, it would be reasonable to assume this is what I was looking for
Spider Monkey - A well-known type of monkey, still quite relevant
Monkey: Journey to the West - A stage play featuring a main character who is a monkey. Likely less relevant.
African green monkey lymphotropic polyomavirus - A human tumor virus. Relevance to query limited.

If it's really necessary, this can be overridden in a custom DefaultSimilarity. And overriding computeNorm(state,norm) to simply return state.getBoost();.

Upvotes: 1

SOLR relevance seems tied heavily to length of document indexed

Answers (2)

Related Questions