Zolat Eater
Zolat Eater

Reputation: 53

How to get maximum possible score for a query in elasticsearch?

I have a large document set stored in elasticsearch index and I need to find similar ones to exlude duplicates.

Unfortunately, these documents can have different values, so I can not fully rely on filters. Instead, I am trying to evaluate how much the difference is between two documents using multiple fields and different boost values.

If the difference is too big then document doesn't count as a duplicate. The problem is - I do not know how to evaluate the difference, because _score in search response tells nothing about how big the difference is.

It would be perfect to have maximum possible score for each particular query. How can I achieve this?

Edit: For example, if I execute a query like this it returns a JSON with _score, more than 1.00

Request: GET /documents/sometype/_search

{
    "query": {
        "bool": {
            "should": [
                {"match": {
                    "title": {
                        "query": "some title"
                    }
               }}
            ]
        } 
    }
}

Example response:

{ "took": 1, "timed_out": false, "_shards": ..., "hits": { "total": 100, "max_score": 1.7588379, } }

As documentation says, _score - is just a floating number, saying nothing about its range.

Upvotes: 1

Views: 3551

Answers (2)

nimi
nimi

Reputation: 126

Here's a python snippet to get the current max score by adding a non-existent term (and deleting it afterwards).

This assumes that 25 random lower case characters have ~0 chance of appearing as a term in your index (otherwise change the way the unique string is generated).

import elasticsearch
import string
import random 

es = elasticsearch.Elasticsearch()

unique = ''.join(random.choice(string.ascii_lowercase) for i in range(25))

index = "your_index"
doc_type = "your_doctype"
key = "your_key"

es.index(index=index, doc_type=doc_type, body={key: unique}, id=unique, params={"refresh": "true"})

body = {
        "doc": {index: unique},
        "term_statistics": True,
        "field_statistics": True,
        "positions": False,
        "offsets": False,
        "filter": {
            "min_term_freq": 0,
            "min_doc_freq": 0
        }
    }

result = es.termvectors(index=index, doc_type=doc_type, body=body)
    
max_es_term_score[index] = result["term_vectors"][index]["terms"][unique]

es.delete(index=index, doc_type=doc_type, id=unique)

Upvotes: 0

fwilhelm
fwilhelm

Reputation: 358

That's an interesting question. Since the Practical Scoring Function(PSF) in general uses the inverse-document-frequency (IDF) the question "What is the maximum document score given a query" is not well-posed. The scoring result will depend on all documents, i.e. the index, and even the number of shards in your ES configuration.

My guess would be that by modifying the index it is possible to show that the maximum score of a query is unbounded if IDF is used.

In special cases though, if you deactivate the IDF part of the PSF by using e.g. constant_score, the maximum score should be bounded since it only depends on the doc itself, not the index.

That being said, I would also love to see an _max_score endpoint returning inf in case of IDF is used somewhere in the query and the actual maximum document score if not.

Upvotes: 2

Related Questions