Getting the total count of all the documents in buckets

Question

When I'm searching by the following aggregation:

"aggregations": {
"codes": {
  "terms": {
    "field": "code"
  },
  "aggs": {
    "dates": {
      "date_range": {
        "field": "created_time",
        "ranges": [
          {
            "from": "2017-12-06T00:00:00.000",
            "to": "2017-12-06T16:00:00.000"
          },
          {
            "from": "2017-12-07T00:00:00.000",
            "to": "2017-12-07T23:59:59.999"
          }
        ]
      }
    }
  }
}
}

I get the following result:

"aggregations": {
"codes": {
  "buckets": [
    {
      "key": "123456",
      "doc_count": 104005499,
      "dates": {
        "buckets": [
          {
            "key": "2017-12-05T20:00:00.000Z-2017-12-06T12:00:00.000Z",
            "from_as_string": "2017-12-05T20:00:00.000Z",
            "to_as_string": "2017-12-06T12:00:00.000Z",
            "doc_count": 156643
          },
          {
            "key": "2017-12-06T20:00:00.000Z-2017-12-07T19:59:59.999Z",
            "from_as_string": "2017-12-06T20:00:00.000Z",
            "to_as_string": "2017-12-07T19:59:59.999Z",
            "doc_count": 11874
          }
        ]
      }
    },
    ...
  ]
 }
}

So now I have a list of buckets of buckets. I need to have a total count value for each bucket, which is the sum of doc_counts of the buckets inside. For example, the total count for my first bucket should be 156643 + 11874 = 168517. I've tried using Sub Bucket aggregation, but

 "totalcount": {
      "sum_bucket": {
        "buckets_path": "dates"
      }
    }

this is not going to work, because "buckets_path must reference either a number value or a single value numeric metric aggregation, got: org.elasticsearch.search.aggregations.bucket.range.date.InternalDateRange.Bucket". Any ideas how should I do this?

Nikolay Vasiliev · Accepted Answer

Looks like this is a known problem. There's a discussion on Elastic forum, where I have found a hack to solve it (thanks to Ruslan_Didyk, the author, btw):

POST my_aggs/my_doc/_search
{
  "size": 0,
  "aggregations": {
    "codes": {
      "terms": {
        "field": "code"
      },
      "aggs": {
        "dates": {
          "date_range": {
            "field": "created_time",
            "ranges": [
              {
                "from": "2017-12-06T00:00:00.000",
                "to": "2017-12-06T16:00:00.000"
              },
              {
                "from": "2017-12-07T00:00:00.000",
                "to": "2017-12-07T23:59:59.999"
              }
            ]
          },
          "aggs": {
            "my_cnt": {
              "value_count": {
                "field": "created_time"
              }
            }
          }
        },
        "totalcount": {
          "stats_bucket": {
            "buckets_path": "dates>my_cnt"
          }
        }
      }
    }
  }
}

The reason why you can't make only totalcount is because date_range implicitly creates sub-buckets and pipeline aggregations cannot handle it (I would say it is a bug of Elasticsearch).

So the hack is to add another sub-aggregation to dates: my_cnt that just counts the amount of documents in the bucket. (Note that I used value_count aggregation on created_time field assuming it's present in all documents and has only one value.)

Given the set of documents like this:

{"code":"1234","created_time":"2017-12-06T01:00:00"}
{"code":"1234","created_time":"2017-12-06T17:00:00"}
{"code":"1234","created_time":"2017-12-07T01:00:00"}
{"code":"1234","created_time":"2017-12-06T02:00:00"}
{"code":"1235","created_time":"2017-12-07T18:00:00"}
{"code":"1234","created_time":"2017-12-07T18:00:00"}

The result of the aggregation will be:

  "aggregations": {
    "codes": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "1234",
          "doc_count": 5,
          "dates": {
            "buckets": [
              {
                "key": "2017-12-06T00:00:00.000Z-2017-12-06T16:00:00.000Z",
                "from": 1512518400000,
                "from_as_string": "2017-12-06T00:00:00.000Z",
                "to": 1512576000000,
                "to_as_string": "2017-12-06T16:00:00.000Z",
                "doc_count": 2,
                "my_cnt": {
                  "value": 2
                }
              },
              {
                "key": "2017-12-07T00:00:00.000Z-2017-12-07T23:59:59.999Z",
                "from": 1512604800000,
                "from_as_string": "2017-12-07T00:00:00.000Z",
                "to": 1512691199999,
                "to_as_string": "2017-12-07T23:59:59.999Z",
                "doc_count": 2,
                "my_cnt": {
                  "value": 2
                }
              }
            ]
          },
          "totalcount": {
            "count": 2,
            "min": 2,
            "max": 2,
            "avg": 2,
            "sum": 4
          }
        },
        {
          "key": "1235",
          "doc_count": 1,
          "dates": {
            "buckets": [
              {
                "key": "2017-12-06T00:00:00.000Z-2017-12-06T16:00:00.000Z",
                "from": 1512518400000,
                "from_as_string": "2017-12-06T00:00:00.000Z",
                "to": 1512576000000,
                "to_as_string": "2017-12-06T16:00:00.000Z",
                "doc_count": 0,
                "my_cnt": {
                  "value": 0
                }
              },
              {
                "key": "2017-12-07T00:00:00.000Z-2017-12-07T23:59:59.999Z",
                "from": 1512604800000,
                "from_as_string": "2017-12-07T00:00:00.000Z",
                "to": 1512691199999,
                "to_as_string": "2017-12-07T23:59:59.999Z",
                "doc_count": 1,
                "my_cnt": {
                  "value": 1
                }
              }
            ]
          },
          "totalcount": {
            "count": 1,
            "min": 1,
            "max": 1,
            "avg": 1,
            "sum": 1
          }
        }
      ]
    }
  }

The desired value is under totalcount.sum.

Some considerations

As I already said, this is valid only if the assumption that created_time is always present and is exactly one holds. If in a different situation the field under date_range aggregation would have several values (e.g. update_time to indicate all updates of a document), then sum will no longer be equal to the actual number of the matched documents (if these dates overlap).

In this case you can always go with filter aggregation with a range query inside.

Hope that helps!

Getting the total count of all the documents in buckets

Answers (1)

Some considerations

Related Questions