Reputation: 14699
Let's say we issue a query (the exact type is not relevant) for the term: "cosmopolitan," on a particular field, and let's assume that the result set contains several documents which each contain exactly 'k' instances of "cosmopolitan."
By whatever mechanism is applicable (boosting, weighting, sorting, etc), I'd like the result set returned such that the positions of "cosmopolitan" within the documents is taken into account, i.e. if the average position of cosmopolitan is lower (closer to the start of the doc), then its rank/score is higher.
I've looked into different sorts of queries and scripting, but can't seem to find something that applies to this, which seems odd since for many domains the term position can be really important.
Upvotes: 1
Views: 277
Reputation: 16905
If we're talking about exact substrings of an arbitrary myfield
, we can use a sorting script which subtracts the index of first occurrence from the whole string length, thereby boosting earlier occurrences:
{
"query": { ... },
"sort": [
{
"_script": {
"script": {
"params": {
"substr_value": "cosmopolitan"
},
"source": """
def fieldval = doc['myfield.keyword'].value;
def indexof = fieldval.indexOf(params.substr_value);
return indexof == -1 ? _score : _score + (fieldval.length() - indexof)
"""
},
"type": "number",
"order": "desc"
}
}
]
}
The .keyword
mapping is not required -- the field could've had the fielddata: true
setting too -- either way, we'll need access to the original value of myfield
in order for this script to work.
Alternatively, a function score query is a great fit here:
{
"query": {
"function_score": {
"query": {
"match": {
"myfield": "cosmopolitan"
}
},
"script_score": {
"script": {
"params": {
"substr_value": "cosmopolitan"
},
"source": """
def fieldval = doc['myfield.keyword'].value;
def indexof = fieldval.indexOf(params.substr_value);
return indexof == -1 ? _score : (fieldval.length() - indexof)
"""
}
},
"boost_mode": "sum"
}
}
}
You can tweak its parameters like boost_mode
, weight
etc to suit your needs.
Also, you'll probably want to do some weighted avg of all the substring occurrences and you can do so within those scripts.
Upvotes: 2