Sergio6Rey
Sergio6Rey

Reputation: 401

How to aggregate the result of aggregation in Elasticsearch?

I want to aggregate the result of other aggregation using Elasticsearch. I have created the first aggregation I need:

es.search(index='stackoverflow', body = {
    "size":0,
    "query": {
        "bool": {
          "filter": {
              "match" : {"type": "Posts"}
          },
          "filter": {
              "match" : {"PostTypeId": "1"}
          }
        }
    },
    "aggs" : {
        "by_user": {
          "terms": {
            "field": "OwnerUserId"
          }
        }
    }
})

This query takes all the documents of type post that are questions (PostTypeId = 1). Then, it aggregates by OwnerUserId, which counts the number of question posts of each user, giving the following result:

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': None,
  'hits': []},
 'aggregations': {'by_user': {'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 31053,
   'buckets': [{'key': '2230', 'doc_count': 223},
    {'key': '', 'doc_count': 177},
    {'key': '38304', 'doc_count': 158},
    {'key': '5997', 'doc_count': 144},
    {'key': '4048', 'doc_count': 130},
    {'key': '25813', 'doc_count': 119},
    {'key': '27826', 'doc_count': 119},
    {'key': '2633', 'doc_count': 115},
    {'key': '19919', 'doc_count': 114},
    {'key': '13938', 'doc_count': 111}]}}}

Now I want to do another aggregation over the results of the previous one: aggregate by doc_count, I mean grouping and counting the equal number of question posts. For the previous result, my desired result would be:

{'buckets': [{'key': '223', 'doc_count': 1},
    {'key': '177', 'doc_count': 1},
    {'key': '158', 'doc_count': 1},
    {'key': '144', 'doc_count': 1},
    {'key': '130', 'doc_count': 1},
    {'key': '119', 'doc_count': 2},
    {'key': '115', 'doc_count': 1},
    {'key': '114', 'doc_count': 1},
    {'key': '111', 'doc_count': 1}]}

Upvotes: 2

Views: 3004

Answers (2)

Yegor Tokmakov
Yegor Tokmakov

Reputation: 1

The "Scripted Metric Aggregation" is a thing that will help to make such thing and get the statistics as a response from singe query.

The advantage of this approach is that you can include any logic and retrieve the necessary data.

https://www.elastic.co/guide/en/elasticsearch/reference/6.4/search-aggregations-metrics-scripted-metric-aggregation.html

Hope this will help others who is still looking for the resolution for the same kinds of tasks.

Upvotes: 0

Sergio6Rey
Sergio6Rey

Reputation: 401

I could find a way to aggregate the result of an aggregation (at least directly). As I have read in the forums of Elasticsearch, this use case is not contemplate because it would be so inefficient.

What I did in order to solve my use case was take advantage of the transform API to store the first aggregation in a temporal index and then perform the second aggregation over that index.

First I create a transform to perform the first aggregation (grouping by OwnerUserId and counting the number of questions posted by each user):

url = 'http://localhost:9200/_transform/transform_rq1'
headers = {
   'Content-Type': 'application/json'
}
query = {
  "source": {
    "index": "posts",
    "query": {
        "bool": {
          "filter": {
              "match" : {"PostTypeId": "1"}
          }
        }
    }
  },
  "dest": {
    "index": "rq1"
  },
  "pivot": {
    "group_by": {
      "OwnerUserId": {
        "terms": {
          "field": "OwnerUserId"
        }
      }
    },
    "aggregations": {
      "count": {
        "value_count": {
          "field": "OwnerUserId"
        }
      }
    }
  }
}

response = requests.put(url, headers=headers, data=json.dumps(query))

Then, I start the transform in order to execute it:

url = 'http://localhost:9200/_transform/transform_rq1/_start'
headers = {
   'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers).json()

At last, I perform the second aggregation (grouping by the number of questions of each user to obtain how many users post how much questions) over the created temporal index:

response = es.search(index='rq1', body = {
    "size":0,
    "query": {
                "match_all": {}
             },
    "aggs" : {
        "by_num": {
          "terms": {
            "field": "count",
            "order" : { "_key" : "asc" },
            "size": 30000
          }
        }
    }
})

print(response)

As you can see, I have written this code in Python.

Upvotes: 1

Related Questions