Deepak Hasani
Deepak Hasani

Reputation: 33

Group by multiple columns in elastic search with specific result set

I am new to ES and i have some specific requirement, my document look like this

{
    "_index" : "bidder_server_stats",
    "_type" : "doc",
    "_id" : "_NTrHGQBv0YTjfMi0Ord",
    "_score" : 1.0,
    "_source" : {
      "avg_price" : 5.8,
      "bc" : "2513",
      "log_dt_st" : "2018-06-08T06:36:16.073Z",
      "nid" : "1",
      "cc" : "880",
      "host" : "ip-172-31-18-62.ec2.internal",
      "country" : "us"
    }
  },
  {
    "_index" : "bidder_server_stats",
    "_type" : "doc",
    "_id" : "_NTrHGQBv0YTjfMi0Ord",
    "_score" : 1.0,
    "_source" : {
      "avg_price" : 10,
      "bc" : "2514",
      "log_dt_st" : "2018-06-08T06:36:16.073Z",
      "nid" : "1",
      "cc" : "880",
      "host" : "ip-172-31-18-62.ec2.internal",
      "country" : "us"
    }
  },
  {
    "_index" : "bidder_server_stats",
    "_type" : "doc",
    "_id" : "_NTrHGQBv0YTjfMi0Ord",
    "_score" : 1.0,
    "_source" : {
      "avg_price" : 11,
      "bc" : "2513",
      "log_dt_st" : "2018-06-08T06:36:16.073Z",
      "nid" : "1",
      "cc" : "880",
      "host" : "ip-172-31-18-62.ec2.internal",
      "country" : "us"
    }
  }

Now i need the result as i get using below query

select bc,log_dt_st,sum(avg_price) from table group by bc,log_dt_st.

How can we do this in elasticsearch. And i want only these three columns in result set (i.e. _source).

Please help

Upvotes: 2

Views: 7047

Answers (1)

Nikolay Vasiliev
Nikolay Vasiliev

Reputation: 6066

You can achieve this with sub-aggregations. Starting from ES 6.1, composite aggregation could also come handy (although it is still experimental).

The query might look like this:

POST bidder_server_stats/doc/_search
{
  "size": 0,
  "aggs": {
    "by bc": {
      "terms": {
        "field": "bc"
      },
      "aggs": {
        "by log_dt_st": {
          "terms": {
            "field": "log_dt_st"
          },
          "aggs": {
            "sum(avg_price)": {
              "sum": {
                "field": "avg_price"
              }
            }
          }
        }
      }
    }
  }
}

And the response would look like this:

{
  ...
  "aggregations": {
    "by bc": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "2513",
          "doc_count": 2,
          "by log_dt_st": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1528439776073,
                "key_as_string": "2018-06-08T06:36:16.073Z",
                "doc_count": 2,
                "sum(avg_price)": {
                  "value": 16.800000190734863
                }
              }
            ]
          }
        },
        {
          "key": "2514",
          "doc_count": 1,
          "by log_dt_st": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1528439776073,
                "key_as_string": "2018-06-08T06:36:16.073Z",
                "doc_count": 1,
                "sum(avg_price)": {
                  "value": 10
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Few moments to consider:

  • bc should have keyword type (to be able to do terms aggregation on it)
  • terms aggregation only returns top-10 buckets by default; you may be interested in size and sort options of this aggregation

Update: Responding to the questions from the comments since it will improve the answer.

Can we have some more fields in result set without adding them into aggregations?

No, not directly. Like in a SQL GROUP BY, all fields returned should be either part of GROUP BY or aggregate functions.

There are few options to actually get more data alongside the aggregations:

  • the search results themselves (the hits part);
  • top_hits aggregation, which allows to have a few most relevant documents for a given bucket.

How many sub aggregations we can add?

I cannot find any relevant documentation or a configuration setting that would allow to have a sure answer. However, there is index.max_docvalue_fields_search setting that defaults to 100 in Dynamic index settings. Since aggregations use doc_values, I'd say that around 100 bucket aggregations is a reasonable upper limit.

I believe that the limitation here is the actual performance of your Elasticsearch cluster.

Can we have all resulted fields in same bucket?

It can be done, but might be not efficient. You may use script mode of terms aggregation. The query might look like this:

POST bidder_server_stats/doc/_search
{
  "size": 0,
  "aggs": {
    "via script": {
      "terms": {
        "script": {
          "source": "doc['bc'].value +':::'+ doc['log_dt_st'].value ",
          "lang": "painless"
        }
      },
      "aggs": {
        "sum(avg_price)": {
          "sum": {
            "field": "avg_price"
          }
        }
      }
    }
  }
}

And the result will look like the following:

{
  ...
  "aggregations": {
    "via script": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "2513:::2018-06-08T06:36:16.073Z",
          "doc_count": 2,
          "sum(avg_price)": {
            "value": 16.800000190734863
          }
        },
        {
          "key": "2514:::2018-06-08T06:36:16.073Z",
          "doc_count": 1,
          "sum(avg_price)": {
            "value": 10
          }
        }
      ]
    }
  }
}

In order to perform this aggregation Elasticsearch will have to compute the bucket value for each document matching the query, which is an equivalent of the full scan in SQL. Aggregations instead are more like index look-ups, since they use doc_values data representation, a data structure that makes these look-ups efficient.

In some cases script buckets can be a solution, but their scope is quite limited. In case you are interested in script-based solution there is also scripted metric aggregation to consider.

Hope that helps!

Update: Since ES 6.1 it is possible to do with composite aggregation

In Elasticsearch 6.1 composite aggregation was added. As of 6.3, it is still marked as experimental (so the API might change, or this feature could be removed completely in the future).

The query in this case would look like:

POST bidder_server_stats/doc/_search
{
  "size": 0,
  "aggs": {
    "my composite": {
      "composite": {
        "sources": [
          {
            "bc": {
              "terms": {
                "field": "bc"
              }
            }
          },
          {
            "log_dt_st": {
              "terms": {
                "field": "log_dt_st"
              }
            }
          }
        ]
      },
      "aggs": {
        "sum(avg_price)": {
          "sum": {
            "field": "avg_price"
          }
        }
      }
    }
  }
}

And the response:

{
  "aggregations": {
    "my composite": {
      "after_key": {
        "bc": "2514",
        "log_dt_st": 1528439776073
      },
      "buckets": [
        {
          "key": {
            "bc": "2513",
            "log_dt_st": 1528439776073
          },
          "doc_count": 2,
          "sum(avg_price)": {
            "value": 16.800000190734863
          }
        },
        {
          "key": {
            "bc": "2514",
            "log_dt_st": 1528439776073
          },
          "doc_count": 1,
          "sum(avg_price)": {
            "value": 10
          }
        }
      ]
    }
  }
}

Upvotes: 2

Related Questions