Sum NL
Sum NL

Reputation: 1105

ElasticSearch: how min_doc_count affects performance?

Why is the time taken of this:

"aggs": {
            "Condition": {
                "terms": {
                    "field": "color",
                    "size": 10,
                    "min_doc_count": 1
                }
          }

is drastically faster than this:

    "aggs": {
            "Condition": {
                "terms": {
                    "field": "color",
                    "size": 10,
                    "min_doc_count": 0
                }
          }

Even though they both return the same aggregation result to me?

Upvotes: 1

Views: 4256

Answers (2)

keety
keety

Reputation: 17441

To add on to @moliware answer from the documentation excerpt

Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of the returned terms which have a document count of zero might only belong to deleted documents or documents from other types, so there is no warranty that a match_all query would find a positive document count for those terms.

besides deleted documents with min_doc_count=0 another significant caveat is that aggregation is not restricted to documents that match the parent query or restricted to the types .

See the below example: Example:

1) Create test index

PUT  test

2) Insert documents of type1 and type3

POST _bulk 
{"index":{"_index":"test","_type":"type1","_id":"1"}}
{"condition":"good"}
{"index":{"_index":"test","_type":"type1","_id":"2"}}
{"condition":"bad"}
{"index":{"_index":"test","_type":"type1","_id":"3"}}
{"condition":"soso"}
{"index":{"_index":"test","_type":"type1","_id":"4"}}
{"condition":"excellent"}
{"index":{"_index":"test","_type":"type1","_id":"5"}}
{"condition":"bad"}
{"index":{"_index":"test","_type":"type1","_id":"6"}}
{"condition":"bad"}
{"index":{"_index":"test","_type":"type1","_id":"7"}}
{"condition":"excellent"}
{"index":{"_index":"test","_type":"type3","_id":"1"}}
{"condition":"unwell"}

3) Query all documents of type1 without term bad:

POST test/type1/_search
{
   "query": {
      "bool": {
         "must_not": {
            "term": {
               "condition": "bad"
            }
         }
      }
   },
    "aggs": {
            "condition_value": {
                "terms": {
                    "field": "condition",
                    "size": 10,
                    "min_doc_count": 0
                }
          }
    }

}

Response:

  "aggregations": {
      "condition_value": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "excellent",
               "doc_count": 2
            },
            {
               "key": "good",
               "doc_count": 1
            },
            {
               "key": "soso",
               "doc_count": 1
            },
            {
               "key": "bad",
               "doc_count": 0
            },
            {
               "key": "unwell",
               "doc_count": 0
            }
         ]
      }
   }

Note the documents of type:type3 and condition:bad in the results. Since the term aggregations are by default orderd by doc_count and the OP has size:10 it may appear to not affect the overall result setting size:0 would give a better picture. In short the number of terms used for generating the aggregation would be significantly larger with min_doc_count:0.

Upvotes: 2

moliware
moliware

Reputation: 10278

Extracted from the documentation:

Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of the returned terms which have a document count of zero might only belong to deleted documents or documents from other types, so there is no warranty that a match_all query would find a positive document count for those terms.

So it seems that if you have lots of deleted documents the performance would be worse because the aggregation would process a bigger amount of documents. Try to optimize the index to see if the performance becomes similar.

Upvotes: 2

Related Questions