Reputation: 63
I'm using a elasticsearch terms
aggregation to bucket based on an array property on each document. I'd like to get the total number of documents in each bucket.
Let's say each document is a Post
, and has an array property media
which specifies which social media website the post is on (and may be empty):
{
id: 1
media: ["facebook", "twitter", "instagram"]
}
{
id: 2
media: ["twitter", "instagram", "tiktok"]
}
{
id: 3
media: ["instagram"]
}
{
id: 4
media: []
}
And here's my terms
aggregation on media
.
"aggs": {
"Posts_by_media": {
"terms": {
"field": "media",
"size": 1000
}
}
}
}
This will return the following:
{
...
"aggregations": {
"Posts_by_media": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "instagram",
"doc_count": 3
},
{
"key": "twitter",
"doc_count": 2
},
{
"key": "facebook",
"doc_count": 1
},
{
"key": "tiktok",
"doc_count": 1
}
]
}
}
}
Along with this result, I want to know the total number of documents in these buckets.
As you can see, the documents will be counted for each value in media
. So, the post with id: 1
will count for the three buckets of facebook
, twitter
and instagram
.
So, it will not suffice to add each of the bucket counts together. I'll end up with 7, where the correct answer should be 3 (because the document with media: []
will not be included in any bucket).
Is there a way to return the total number of documents in these buckets?
Looking at the docs a bit, it seems like it's possible that I could use another aggregation, an exists aggregation, like so:
{
"aggs": {
"filter": {
"exists": {
"field": "media"
}
}
}
}
But, I don't feel great about the fact that this is a separate aggregation -- it feels like If I'm not careful, I could get the two out of sync.
What's the recommendation here?
Upvotes: 0
Views: 938
Reputation: 146
From your description you want to query the number of data where media
is not empty?If so, you can use the following query.
{
"query": {
"bool": {
"must": [
{
"exists": {
"field": "media"
}
}
]
}
},
"size": 0
}
the total
from response is the number of hits in this query.
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
}
}
Upvotes: 0