Reputation: 1778
In Elasticsearch 8, I have the following index:
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "whitespace"
},
"default_search": {
"type": "whitespace"
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"term_vector": "with_positions"
}
}
}
}
And this query
{
"query": {
"simple_query_string": {
"query": "banana"
}
},
"track_scores": true,
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": {
"source": " >>> HOW TO GET THE POSITION OF banana IN THE title FIELD? <<< ",
"lang": "painless"
}
}
}
}
I found very old answers, such as:
_index['title'].get('banana',_POSITIONS);
But that errors:
cannot resolve symbol [_index]
I need this because I want higher score for documents where the query appears earlier in the title field.
Upvotes: 0
Views: 115
Reputation: 116
use like below :
{
"query": {
"simple_query_string": {
"query": "banana"
}
},
"track_scores": true,
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": {
"source": """
int position = -1;
for (int i = 0; i < doc['title'].size(); i++) {
if (doc['title'][i] == 'banana') {
position = i;
break;
}
}
return position;
""",
"lang": "painless"
}
}
}
}
Upvotes: 0
Reputation: 496
I've explored there is no tools for getting term position in Elasticsearch (except the Analysis Predicate Context)
So my solution is to make a parallel scripted tokenizer
Sample documents
PUT /term_position_score/_bulk
{"create":{"_id":1}}
{"text": "apple jackfruit banana"}
{"create":{"_id":2}}
{"text": "apple banana apple"}
{"create":{"_id":3}}
{"text": "banana apple apple"}
{"create":{"_id":4}}
{"text": "nobananas apple apple"}
{"create":{"_id":5}}
{"text": "banana"}
{"create":{"_id":6}}
{"text": "banana apple"}
{"create":{"_id":7}}
{"text": "apple banana"}
{"create":{"_id":8}}
{"text": "apple jackfruit apple jackfruit apple jackfruit apple banana"}
Query with function_score
and script
GET /term_position_score/_search?filter_path=hits.hits
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": """
int getItemPosition(def array, def item) {
int arrayLength = array.length;
for (int i = 0; i < arrayLength; i++) {
if (item == array[i]) {
return i;
}
}
return -1;
}
String term = params['query'];
String[] terms = /\s/.split(params['_source']['text']);
int termCount = terms.length;
int termPosition = getItemPosition(terms, term);
double correctedScore = (termPosition == -1) ? 0 : (5 - Math.log(termPosition + termCount * 2));
return correctedScore;
""",
"params": {
"query": "banana"
}
}
}
}
}
}
Response
{
"hits" : {
"hits" : [
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "5",
"_score" : 4.306853,
"_source" : {
"text" : "banana"
}
},
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "6",
"_score" : 3.6137056,
"_source" : {
"text" : "banana apple"
}
},
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "7",
"_score" : 3.390562,
"_source" : {
"text" : "apple banana"
}
},
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "3",
"_score" : 3.2082405,
"_source" : {
"text" : "banana apple apple"
}
},
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.0540898,
"_source" : {
"text" : "apple banana apple"
}
},
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.9205585,
"_source" : {
"text" : "apple jackfruit banana"
}
},
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "8",
"_score" : 1.8645058,
"_source" : {
"text" : "apple jackfruit apple jackfruit apple jackfruit apple banana"
}
},
{
"_index" : "term_position_score",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.0,
"_source" : {
"text" : "nobananas apple apple"
}
}
]
}
}
I've modified score function that text with less term count and found term position has bigger score
Limitations
/\s/
regex splitter. If you need an another tokenizer, you must change the regexparams
of the scriptYou can employ a search template in this use case
Upvotes: 0