Danielzt
Danielzt

Reputation: 503

Elasticsearch Deep aggregation doc count doesnt match

I did several aggregations to SUM some values on our installation of ES 1.7.2.

Found the hard way that on some random situations, the doc_count of each aggregation, doesn't match with the SUM of doc_count of the nested level.

"key": 503,
"doc_count": 383778,
"regionid": {...}

So doc_count=383778

If I SUM doc_count of every element of the regionid of the list bellow, I have doc_count=383718

 "key": 503,
 "doc_count": 383778,
 "regionid": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
       {
          "key": 1,
          "doc_count": 303821,
          "ProviderId": {...}
       },
       {
          "key": 27,
          "doc_count": 23834,
          "ProviderId": {...}
       },
       {
          "key": 25,
          "doc_count": 9565,
          "ProviderId": {...}
       },
       {
          "key": 36,
          "doc_count": 8857,
          "ProviderId": {...}
       },
       {
          "key": 14,
          "doc_count": 8222,
          "ProviderId": {...}
       },
       {
          "key": 68,
          "doc_count": 6746,
          "ProviderId": {...}
       },
       {
          "key": 19,
          "doc_count": 4574,
          "ProviderId": {...}
       },
       {
          "key": 28,
          "doc_count": 4164,
          "ProviderId": {...}
       },
       {
          "key": 10,
          "doc_count": 3006,
          "ProviderId": {...}
       },
       {
          "key": 31,
          "doc_count": 2020,
          "ProviderId": {...}
       },
       {
          "key": 21,
          "doc_count": 1410,
          "ProviderId": {...}
       },
       {
          "key": 32,
          "doc_count": 1368,
          "ProviderId": {...}
       },
       {
          "key": 22,
          "doc_count": 1367,
          "ProviderId": {...}
       },
       {
          "key": 8,
          "doc_count": 1010,
          "ProviderId": {...}
       },
       {
          "key": 16,
          "doc_count": 825,
          "ProviderId": {...}
       },
       {
          "key": 35,
          "doc_count": 559,
          "ProviderId": {...}
       },
       {
          "key": 34,
          "doc_count": 517,
          "ProviderId": {...}
       },
       {
          "key": 26,
          "doc_count": 414,
          "ProviderId": {...}
       },
       {
          "key": 18,
          "doc_count": 371,
          "ProviderId": {...}
       },
       {
          "key": 15,
          "doc_count": 362,
          "ProviderId": {...}
       },
       {
          "key": 33,
          "doc_count": 185,
          "ProviderId": {...}
       },
       {
          "key": 9,
          "doc_count": 143,
          "ProviderId": {...}
       },
       {
          "key": 29,
          "doc_count": 102,
          "ProviderId": {...}
       },
       {
          "key": 17,
          "doc_count": 100,
          "ProviderId": {...}
       },
       {
          "key": 30,
          "doc_count": 96,
          "ProviderId": {...}
       },
       {
          "key": 20,
          "doc_count": 80,
          "ProviderId": {...}
       }
    ]
 }
},

Do you guys know why is this happening?

Maybe a bug?

Part of my aggregation:

 {
    "aggs": {
       "Provider": {
          "terms": {
             "field": "Provider"
          },
          "aggs": {
             "Gateway": {
                "terms": {
                   "field": "Gateway"
                },
                "aggs": {
                   "CustomerId": {
                      "terms": {
                         "field": "CustomerId"
                      },
                      "aggs": {
                         "regionid": {
                            "terms": {
                               "field": "regionid"

Any help is appreciated. Thanks

Upvotes: 0

Views: 1204

Answers (1)

jhilden
jhilden

Reputation: 12449

Aggregations in ES are not exact, they are an estimate based on the number of records sampled. Given a big enough sample size, that number can be exact, but that has significant performance implications.

You can read more info on "Shard Size" in the ES documentation on shard_size for terms aggregation

The flatter your index (meaning the more buckets the aggregation returns) the more you need to increase the Shard Size. We found that for a flat index in our system a 20x multiplier was a good rule of thumb. So if I'm returning the top 10 records for an aggregation, we use a shard size of 200.

Upvotes: 3

Related Questions