What is the reasoning behind the ranking of this ElasticSearch query?

Question

I have two documents:

{
    id: 7,
    title: 'Wet',
    description: 'asdfasdfasdf'
}

{
    id: 6
    title: 'Wet wet',
    description: 'asdfasdfasdf'
}

They are almost identical except for the extra word in the 2nd document.

My query is this:

var qobject = {
        query:{
            custom_score:{
                query:{
                   multi_match:{
                     query: q, //I searched for "wet"
                     fields: ['title','description'],
                   }
                },
                script: '_score '
            }
        }
    }

OK, so when I run this query I get these results:

{ total: 2,
  max_score: 1.8472979,
  hits: 
   [ { _index: 'products',
       _type: 'products',
       _id: '7',
       _score: 1.9808292,
       _source: [Object] },
     { _index: 'products',
       _type: 'products',
       _id: '6',
       _score:  1.7508222,
       _source: [Object] } ] }

How come id 7 is ranked higher than id 6? What is the reasoning behind the score? Shouldn't 6 be ranked higher because it has 2 words?

What if I want more words = more weights? What do I do to my query to modify that?

Explain is below:

"_explanation": {
            "value": 1.9808292,
            "description": "custom score, product of:",
            "details": [
                {
                    "value": 1.9808292,
                    "description": "script score function: composed of:",
                    "details": [
                        {
                            "value": 1.9808292,
                            "description": "fieldWeight(title:wet in 0), product of:",
                            "details": [
                                {
                                    "value": 1,
                                    "description": "tf(termFreq(title:wet)=1)"
                                },
                                {
                                    "value": 1.9808292,
                                    "description": "idf(docFreq=2, maxDocs=8)"
                                },
                                {
                                    "value": 1,
                                    "description": "fieldNorm(field=title, doc=0)"
                                }
                            ]
                        }
                    ]
                },
                {
                    "value": 1,
                    "description": "queryBoost"
                }
            ]
        }

"_explanation": {
            "value": 1.7508222,
            "description": "custom score, product of:",
            "details": [
                {
                    "value": 1.7508222,
                    "description": "script score function: composed of:",
                    "details": [
                        {
                            "value": 1.7508222,
                            "description": "fieldWeight(title:wet in 0), product of:",
                            "details": [
                                {
                                    "value": 1.4142135,
                                    "description": "tf(termFreq(title:wet)=2)"
                                },
                                {
                                    "value": 1.9808292,
                                    "description": "idf(docFreq=2, maxDocs=8)"
                                },
                                {
                                    "value": 0.625,
                                    "description": "fieldNorm(field=title, doc=0)"
                                }
                            ]
                        }
                    ]
                },
                {
                    "value": 1,
                    "description": "queryBoost"
                }
            ]
        }

javanna · Accepted Answer

Have a look at the explain output of your query to know why. You can either use the explain api or add "explain": true to your current search request.

By default lucene uses the tf/idf (term frequency, inverted document frequency) similarity to score documents. Different factors are taken into account, for each term that matches the query. The following are the most important ones:

Term frequency: how frequent the term is within the document. The more the better. If a term appears multiple times the document is a better match.
Inverted document frequency: how frequent the term is across the index. The less the better. Rare terms win against common terms.
Norms: index time boosting (default 1, no boost) + field norm, which defines that shorter fields are better than long ones.

Based on the query you're executing, the documents are scored differently because of norms. You can disable norms in your mapping (and reindex), but that way you'll also lose index time boosting (which I don't think you're using anyway). In fact in your example the second document is scored lower because of the lower field norm, despite of the higher term frequency (2 instead of one).

Antoher solution would be to plug in a different lucene similarity: lucene 4 provides more similiraties, and allows also to define the similarity per field. Those features have been exposed in elasticsearch 0.90.

What is the reasoning behind the ranking of this ElasticSearch query?

Answers (1)

Related Questions