Reputation: 902
I'm implementing an auto-complete index in ElasticSearch and have run into an issue with sorting/scoring. Say I have the following strings in an index:
apple banana coconut donut
apple banana donut durian
apple donut coconut durian
donut banana coconut durian
When I search for "donut", I want the results to be ordered by the term location like so:
donut banana coconut durian
apple donut coconut durian
apple banana donut durian
apple banana coconut donut
I can't figure out how to make that happen. Term position isn't factored into the default scoring logic, and I can't find a way to get it in there. Seems like a simple enough issue though that others must have run into this before. Has anyone figured it out yet?
Thanks!
Upvotes: 16
Views: 6230
Reputation: 902
Here's the solution I ended up with, based on Andrei's answer and expanded to support multiple search terms and additional scoring based on length of the first word in the result:
First, define the following custom analyzer (it keeps the entire string as a single token and lowercases it):
"raw_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
Second, define your search field mapping like so (mine's named "name"):
"name": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index_analyzer": "raw_analyzer",
"search_analyzer": "standard"
}
}
},
"_nameFirstWordLength": {
"type": "long"
}
Third, when populating the index use the following logic (mine's in C#) to populate:
_nameFirstWordLength = fi.Name.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries)[0].Length
Finally, do your search as follows:
{
"query":{
"bool":{
"must":{
"match_phrase_prefix":{
"name":{
"query":"apple"
}
}
},
"should":{
"function_score":{
"query":{
"query_string":{
"fields":[
"name.raw"
],
"query":"apple*"
}
},
"script_score":{
"script":"100/doc['_nameFirstWordLength'].value"
},
"boost_mode":"replace"
}
}
}
}
}
I'm using match_phrase_prefix so that partial matches are supported, such as "ap" matching "apple". The bool must/should with that second query_string query against name.raw gives a higher score to results whose name starts with one of the search terms (in my code I'm pre-processing the search string, just for that second query, to add a "*" after every word). Finally, wrapping that second query in a function_score script that uses the value of _nameFirstWordLength causes the results up-scored by the second query to be further sorted by the length of their first word (causing Apple to show before Applebee's, for example).
Upvotes: 1
Reputation: 52368
You can do a custom sorting, like this:
{
"query": {
"match": {
"content": "donut"
}
},
"sort": {
"_script": {
"script": "termInfo=_index['content'].get('donut',_OFFSETS);for(pos in termInfo){return _score+pos.startOffset};",
"type": "number",
"order": "asc"
}
}
}
In there I just returned the startOffset
. If you need something else, play with those values and the original scoring and come up with a comfortable value for your needs.
Or you can do something like this:
{
"query": {
"function_score": {
"query": {
"match": {
"content": "donut"
}
},
"script_score": {
"script": "termInfo=_index['content'].get('donut',_OFFSETS);for(pos in termInfo){return pos.startOffset};"
},
"boost_mode": "replace"
}
},
"sort": [
{
"_score": "asc"
}
]
}
In either case you need in your mapping for that specific field to have this:
"content": {
"type": "string",
"index_options": "offsets"
}
meaning index_options
needs to be set to offsets
. Here more details about this.
Upvotes: 6