Steve P.
Steve P.

Reputation: 14699

Elasticsearch 7.10 How to give more weight to terms that appear earlier in a document

Let's say we issue a query (the exact type is not relevant) for the term: "cosmopolitan," on a particular field, and let's assume that the result set contains several documents which each contain exactly 'k' instances of "cosmopolitan."

By whatever mechanism is applicable (boosting, weighting, sorting, etc), I'd like the result set returned such that the positions of "cosmopolitan" within the documents is taken into account, i.e. if the average position of cosmopolitan is lower (closer to the start of the doc), then its rank/score is higher.

I've looked into different sorts of queries and scripting, but can't seem to find something that applies to this, which seems odd since for many domains the term position can be really important.

Upvotes: 1

Views: 277

Answers (1)

Joe - Check out my books
Joe - Check out my books

Reputation: 16905

If we're talking about exact substrings of an arbitrary myfield, we can use a sorting script which subtracts the index of first occurrence from the whole string length, thereby boosting earlier occurrences:

{
  "query": { ... },
  "sort": [
    {
      "_script": {
        "script": {
          "params": {
            "substr_value": "cosmopolitan"
          },
          "source": """
            def fieldval = doc['myfield.keyword'].value;
            def indexof = fieldval.indexOf(params.substr_value);
            return indexof == -1 ? _score : _score + (fieldval.length() - indexof)
          """
        },
        "type": "number",
        "order": "desc"
      }
    }
  ]
}

The .keyword mapping is not required -- the field could've had the fielddata: true setting too -- either way, we'll need access to the original value of myfield in order for this script to work.


Alternatively, a function score query is a great fit here:

{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "myfield": "cosmopolitan"
        }
      },
      "script_score": {
        
        "script": {
          "params": {
            "substr_value": "cosmopolitan"
          },
          "source": """
            def fieldval = doc['myfield.keyword'].value;
            def indexof = fieldval.indexOf(params.substr_value);
            return indexof == -1 ? _score : (fieldval.length() - indexof)
          """
        }
      },
      "boost_mode": "sum"
    }
  }
}

You can tweak its parameters like boost_mode, weight etc to suit your needs.

Also, you'll probably want to do some weighted avg of all the substring occurrences and you can do so within those scripts.

Upvotes: 2

Related Questions