IGx89
IGx89

Reputation: 902

Scoring by term position in ElasticSearch?

I'm implementing an auto-complete index in ElasticSearch and have run into an issue with sorting/scoring. Say I have the following strings in an index:

apple banana coconut donut
apple banana donut durian
apple donut coconut durian
donut banana coconut durian

When I search for "donut", I want the results to be ordered by the term location like so:

donut banana coconut durian
apple donut coconut durian
apple banana donut durian
apple banana coconut donut

I can't figure out how to make that happen. Term position isn't factored into the default scoring logic, and I can't find a way to get it in there. Seems like a simple enough issue though that others must have run into this before. Has anyone figured it out yet?

Thanks!

Upvotes: 16

Views: 6230

Answers (2)

IGx89
IGx89

Reputation: 902

Here's the solution I ended up with, based on Andrei's answer and expanded to support multiple search terms and additional scoring based on length of the first word in the result:

First, define the following custom analyzer (it keeps the entire string as a single token and lowercases it):

"raw_analyzer": {
    "type": "custom",
    "filter": [
        "lowercase"
    ],
    "tokenizer": "keyword"
}

Second, define your search field mapping like so (mine's named "name"):

"name": {
    "type": "string",
    "analyzer": "english",
    "fields": {
        "raw": {
            "type": "string",
            "index_analyzer": "raw_analyzer",
            "search_analyzer": "standard"
        }
    }
},
"_nameFirstWordLength": {
    "type": "long"
}

Third, when populating the index use the following logic (mine's in C#) to populate:

_nameFirstWordLength = fi.Name.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries)[0].Length

Finally, do your search as follows:

{
   "query":{
      "bool":{
         "must":{
            "match_phrase_prefix":{
               "name":{
                  "query":"apple"
               }
            }
         },
         "should":{
            "function_score":{
               "query":{
                  "query_string":{
                     "fields":[
                        "name.raw"
                     ],
                     "query":"apple*"
                  }
               },
               "script_score":{
                  "script":"100/doc['_nameFirstWordLength'].value"
               },
               "boost_mode":"replace"
            }
         }
      }
   }
}

I'm using match_phrase_prefix so that partial matches are supported, such as "ap" matching "apple". The bool must/should with that second query_string query against name.raw gives a higher score to results whose name starts with one of the search terms (in my code I'm pre-processing the search string, just for that second query, to add a "*" after every word). Finally, wrapping that second query in a function_score script that uses the value of _nameFirstWordLength causes the results up-scored by the second query to be further sorted by the length of their first word (causing Apple to show before Applebee's, for example).

Upvotes: 1

Andrei Stefan
Andrei Stefan

Reputation: 52368

You can do a custom sorting, like this:

{
  "query": {
    "match": {
      "content": "donut"
    }
  },
  "sort": {
    "_script": {
      "script": "termInfo=_index['content'].get('donut',_OFFSETS);for(pos in termInfo){return _score+pos.startOffset};",
      "type": "number",
      "order": "asc"
    }
  }
}

In there I just returned the startOffset. If you need something else, play with those values and the original scoring and come up with a comfortable value for your needs.

Or you can do something like this:

{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "content": "donut"
        }
      },
      "script_score": {
        "script": "termInfo=_index['content'].get('donut',_OFFSETS);for(pos in termInfo){return pos.startOffset};"
      },
      "boost_mode": "replace"
    }
  },
  "sort": [
    {
      "_score": "asc"
    }
  ]
}

In either case you need in your mapping for that specific field to have this:

"content": {
  "type": "string",
  "index_options": "offsets"
}

meaning index_options needs to be set to offsets. Here more details about this.

Upvotes: 6

Related Questions