andy
andy

Reputation: 2993

common idf scoring across all multimatch query fields with elasticsearch

With the following document set:

curl -XPUT "http://localhost:9200/test/books/1" -d '{
  "title": "Bacon Dishes",
  "tags": ["bacon", "cooking"]
}'

curl -XPUT "http://localhost:9200/test/books/2" -d '{
  "title": "Beyond Bacon",
  "tags" : ["cooking"]
}'

And the following query:

curl -XGET "http://localhost:9200/test/books/_search?pretty=true&search_type=dfs_query_then_fetch" -d ' {
  "explain" : true,
  "query" : {
    "multi_match" : {
      "query" : "bacon beyond",
      "fields" : ["title^2","tags^1"]
    }
  }
}'

The explain plan shows that the score for title is calculated using idf(docFreq=2, maxDocs=2) while the score for tags (if present) is calculated using idf(docFreq=1, maxDocs=2).

This becomes a problem (at least for us) when there are 100 books where 50 have "bacon" in the title and only 1 has "bacon" in the tags but does not have "bacon" in the title. Using the query above, the document with "bacon" in the tags will be scored higher, despite title being boosted.

I would like for the score calculations for both the tag and title fields in the first example to be:

 idf(docFreq=2, maxDocs=2)

That is, I would like the score calculation to use the docFreq of a term across all fields in the multimatch query. Is this possible?

Upvotes: 1

Views: 326

Answers (1)

javanna
javanna

Reputation: 60245

I would just increase the boost that you give to the title, enough to make it more important than the tags field.

I don't think you want to implement your own custom similarity and plug it in elasticsearch.

Just keep in mind that adding proper weights to fields and playing around with boosting is fine-tuning, which needs to happen with a real index, real data, real queries.

Upvotes: 1

Related Questions