aethos
aethos

Reputation: 63

How can I get a total count across all buckets in a terms aggregation?

I'm using a elasticsearch terms aggregation to bucket based on an array property on each document. I'd like to get the total number of documents in each bucket.

Let's say each document is a Post, and has an array property media which specifies which social media website the post is on (and may be empty):

{
   id: 1
   media: ["facebook", "twitter", "instagram"]
}
{
   id: 2
   media: ["twitter", "instagram", "tiktok"]
}
{
   id: 3
   media: ["instagram"]
}
{
   id: 4
   media: []
}

And here's my terms aggregation on media.

"aggs": {
    "Posts_by_media": {
      "terms": {
        "field": "media",
        "size": 1000
      }
    }
  }
}

This will return the following:

{
  ...
  "aggregations": {
    "Posts_by_media": {
      "doc_count_error_upper_bound": 0,   
      "sum_other_doc_count": 0,           
      "buckets": [                        
        {
          "key": "instagram",
          "doc_count": 3
        },
        {
          "key": "twitter",
          "doc_count": 2
        },
        {
          "key": "facebook",
          "doc_count": 1
        },
        {
          "key": "tiktok",
          "doc_count": 1
        }
      ]
    }
  }
}

Along with this result, I want to know the total number of documents in these buckets.

As you can see, the documents will be counted for each value in media. So, the post with id: 1 will count for the three buckets of facebook, twitter and instagram.

So, it will not suffice to add each of the bucket counts together. I'll end up with 7, where the correct answer should be 3 (because the document with media: [] will not be included in any bucket).

Is there a way to return the total number of documents in these buckets?

Looking at the docs a bit, it seems like it's possible that I could use another aggregation, an exists aggregation, like so:

{
  "aggs": {
    "filter": {
      "exists": {
        "field": "media"
      }
    }
  }
}

But, I don't feel great about the fact that this is a separate aggregation -- it feels like If I'm not careful, I could get the two out of sync.

What's the recommendation here?

Upvotes: 0

Views: 938

Answers (1)

SuperPirate
SuperPirate

Reputation: 146

From your description you want to query the number of data where media is not empty?If so, you can use the following query.

{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "media"
          }
        }
      ]
    }
  },
  "size": 0
}

the total from response is the number of hits in this query.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

Upvotes: 0

Related Questions