Alkis Kalogeris
Alkis Kalogeris

Reputation: 17773

Elasticsearch wrong explanation validate api

I'm using Elasticsearch 5.2. I'm executing the below query against an index that has only one document

Query:

GET test/val/_validate/query?pretty&explain=true
{
  "query": {
    "bool": {
      "should": {
        "multi_match": {
          "query": "alkis stackoverflow",
          "fields": [
            "name",
            "job"
          ],
          "type": "most_fields",
          "operator": "AND"
        }
      }
    }
  }
}

Document:

PUT test/val/1
{
  "name": "alkis stackoverflow",
  "job": "developer"
}

The explanation of the query is

+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow))) #(#_type:val)

I read this as: Field job must have alkis and stackoverflow AND Field name must have alkis and stackoverflow

This is not the case with my document though. The AND between the two fields is actually OR (as it seems from the result I'm getting)

When I change the type to best_fields I get

+(((+job:alkis +job:stackoverflow) | (+name:alkis +name:stackoverflow))) #(#_type:val)

Which is the correct explanation.

Is there a bug with the validate api? Have I misunderstood something? Isn't the scoring the only difference between these two types?

Upvotes: 1

Views: 55

Answers (1)

Val
Val

Reputation: 217564

Since you picked the most_fields type with an explicit AND operator, the reasoning is that one match query is going to be generated per field and all terms must be present in a single field for a document to match, which is your case, i.e. both terms alkis and stackoverflow are present in the name field, hence why the document matches.

So in the explanation of the corresponding Lucene query, i.e.

+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow))) 

when no specific operator is specified between the terms, the default one is an OR

So you need to read this as: Field job must have both alkis and stackoverflow OR field name must have both alkis and stackoverflow.

The AND operator that you apply only concerns all the terms in your query but in regard to a single field, it's not an AND between all fields. Said differently, your query will be executed as a two match queries (one per field) in a bool/should clause, like this:

{
  "query": {
    "bool": {
      "should": [
        { "match": { "job":  "alkis stackoverflow" }},
        { "match": { "name": "alkis stackoverflow" }}
      ]
    }
  }
}

In summary, the most_fields type is most useful when querying multiple fields that contain the same text analyzed in different ways. This is not your case and you'd probably better be using cross_fields or best_fields depending on your use case, but certainly not most_fields.

UPDATE

When using the best_fields type, ES generates a dis_max query instead of a bool/should and the | (which is not an OR !!) sign separates all sub-queries in a dis_max query.

Upvotes: 1

Related Questions