billc
billc

Reputation: 1911

Proper way to filter a query with Elasticsearch? (filter vs filtered query)

I am trying work out if there is a difference between "filters" and "filtered queries" in Elasticsearch.

The two example requests below return the same results, when run against my index.

Are they actually different in some subtle way?

Is there a reason why one would be preferred over the other, in different situations?

DSL giving one top-level query, and one top-level filter:

GET /index/type/_search?_source
{
  "query": {
    "multi_match": {
      "query": "my dog has fleas",
      "fields": ["name", "keywords"]
    }
  },
  "filter": {
    "term": {"status": 2}
  }
}

DSL giving only a top-level query, using the filtered construct:

GET /index/type/_search?_source
{
  "query": {
    "filtered": {
      "query": {
        "multi_match": {
          "query": "my dog has fleas",
          "fields": ["name", "keywords"]
        }
      },
      "filter": {
        "term": {"status": 2}
      }
    }
  }
}

Upvotes: 7

Views: 11645

Answers (2)

Radu Gheorghe
Radu Gheorghe

Reputation: 1124

Later versions of Elasticsearch have a filter clause in the bool query. This will not actually run the filter before the query necessarily, the overall query will get rewritten and optimized as Elasticsearch sees fit (there's no real control on the user's end).

Actually, the only way to control is to use that post_filter, which runs only on results of the query. This will only work (performance-wise) if the filter is very expensive and the query is cheap. Or if you want that filter not to influence aggregations (as aggregations only run on the results of the query). Some E-commerce searches would use this to e.g. filter stock products if that's what you select, but show both stock and non-stock in the aggregations.

If you need more info on Elasticsearch query-building and/or performance, feel free to check out our Elasticsearch training (disclaimer: I'm one of the instructors).

Upvotes: -1

Chris Heald
Chris Heald

Reputation: 62648

The first example is a post_filter, which is sub-optimal from a performance perspective. Filtered queries are preferred, since the filters will be run prior to the queries. Typically, you want your filters to run first, since scoring documents is more expensive than just a boolean pass/fail. That way, your result set is cut down before you run your query on it. With a post_filter, your query is run first, the entire result set is scored, and then the filter is applied to the results.

The top-level filter directive was deprecated in 1.0, and was renamed to post_filter to clarify its purpose and usage.

the top-level filter parameter in search has been renamed to post_filter, to indicate that it should not be used as the primary way to filter search results (use a filtered query instead), but only to filter results AFTER facets/aggregations have been calculated.

http://www.elastic.co/guide/en/elasticsearch/reference/current/_search_requests.html

Upvotes: 18

Related Questions