Skacc
Skacc

Reputation: 1778

Elasticsearch: how to get term position in sort script

In Elasticsearch 8, I have the following index:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "whitespace"
                },
                "default_search": {
                    "type": "whitespace"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "term_vector": "with_positions"
            }
        }
    }
}

And this query

{
    "query": {
        "simple_query_string": {
            "query": "banana"
        }
    },
    "track_scores": true,
    "sort": {
        "_script": {
            "type": "number",
            "order": "desc",
            "script": {
                "source": " >>> HOW TO GET THE POSITION OF banana IN THE title FIELD? <<< ",
                "lang": "painless"
            }
        }
    }
}

I found very old answers, such as:
_index['title'].get('banana',_POSITIONS);
But that errors:
cannot resolve symbol [_index]

I need this because I want higher score for documents where the query appears earlier in the title field.

Upvotes: 0

Views: 115

Answers (2)

Divyarajsinh Barad
Divyarajsinh Barad

Reputation: 116

use like below :

{
    "query": {
        "simple_query_string": {
            "query": "banana"
        }
    },
    "track_scores": true,
    "sort": {
        "_script": {
            "type": "number",
            "order": "desc",
            "script": {
                "source": """
                    int position = -1;
                    for (int i = 0; i < doc['title'].size(); i++) {
                        if (doc['title'][i] == 'banana') {
                            position = i;
                            break;
                        }
                    }
                    return position;
                """,
                "lang": "painless"
            }
        }
    }
}

Upvotes: 0

G0l0s
G0l0s

Reputation: 496

I've explored there is no tools for getting term position in Elasticsearch (except the Analysis Predicate Context)

So my solution is to make a parallel scripted tokenizer

Sample documents

PUT /term_position_score/_bulk
{"create":{"_id":1}}
{"text": "apple jackfruit banana"}
{"create":{"_id":2}}
{"text": "apple banana apple"}
{"create":{"_id":3}}
{"text": "banana apple apple"}
{"create":{"_id":4}}
{"text": "nobananas apple apple"}
{"create":{"_id":5}}
{"text": "banana"}
{"create":{"_id":6}}
{"text": "banana apple"}
{"create":{"_id":7}}
{"text": "apple banana"}
{"create":{"_id":8}}
{"text": "apple jackfruit apple jackfruit apple jackfruit apple banana"}

Query with function_score and script

GET /term_position_score/_search?filter_path=hits.hits
{
    "query": {
        "function_score": {
            "query": {
                "match_all": {}
            },
            "script_score": {
                "script": {
                    "source": """
                        int getItemPosition(def array, def item) {
                            int arrayLength = array.length;
                            for (int i = 0; i < arrayLength; i++) {
                                if (item == array[i]) {
                                    return i;
                                }
                            }
                            return -1;
                        }

                        String term = params['query'];
                        String[] terms = /\s/.split(params['_source']['text']);
                        int termCount = terms.length;
                        int termPosition = getItemPosition(terms, term);
                        
                        double correctedScore = (termPosition == -1) ? 0 : (5 - Math.log(termPosition + termCount * 2));
                        return correctedScore;
                    """,
                    "params": {
                        "query": "banana"
                    }
                }
            }
        }
    }
}

Response

{
    "hits" : {
        "hits" : [
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "5",
                "_score" : 4.306853,
                "_source" : {
                    "text" : "banana"
                }
            },
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "6",
                "_score" : 3.6137056,
                "_source" : {
                    "text" : "banana apple"
                }
            },
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "7",
                "_score" : 3.390562,
                "_source" : {
                    "text" : "apple banana"
                }
            },
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "3",
                "_score" : 3.2082405,
                "_source" : {
                    "text" : "banana apple apple"
                }
            },
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "2",
                "_score" : 3.0540898,
                "_source" : {
                    "text" : "apple banana apple"
                }
            },
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "1",
                "_score" : 2.9205585,
                "_source" : {
                    "text" : "apple jackfruit banana"
                }
            },
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "8",
                "_score" : 1.8645058,
                "_source" : {
                    "text" : "apple jackfruit apple jackfruit apple jackfruit apple banana"
                }
            },
            {
                "_index" : "term_position_score",
                "_type" : "_doc",
                "_id" : "4",
                "_score" : 0.0,
                "_source" : {
                    "text" : "nobananas apple apple"
                }
            }
        ]
    }
}

I've modified score function that text with less term count and found term position has bigger score

Limitations

  • Scripted tokenizer is the /\s/ regex splitter. If you need an another tokenizer, you must change the regex
  • Query is defined in params of the script

You can employ a search template in this use case

Upvotes: 0

Related Questions