dom
dom

Reputation: 444

How to perform sub-aggregation in elasticsearch?

I have a set of article documents in elasticsearch with fields content and publish_datetime.

I am trying to retrieve most frequent words from articles with publish year == 2021.

GET articles/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "word_counts": {
      "terms": {
        "field": "content"
      }
    },
    "publish_datetime": {
      "terms": {
        "field": "publish_datetime"
      }
    },
    "aggs": {
      "word_counts_2021": {
        "bucket_selector": {
          "buckets_path": {
            "word_counts": "word_counts",
            "pd": "publish_datetime"
          },
          "script": "LocalDateTime.parse(params.pd).getYear() == 2021"
        }
      }
    }
  }
}

This fails on

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "Unknown aggregation type [word_counts_2021]",
        "line" : 17,
        "col" : 25
      }
    ],
    "type" : "parsing_exception",
    "reason" : "Unknown aggregation type [word_counts_2021]",
    "line" : 17,
    "col" : 25,
    "caused_by" : {
      "type" : "named_object_not_found_exception",
      "reason" : "[17:25] unknown field [word_counts_2021]"
    }
  },
  "status" : 400
}

which does not make sense, because word_counts2021 is the name of the aggregation accordings to docs. It's not an aggregation type. I am the one who pics the name, so I thought it could have had basically any value.

Does anyone have any idea, what's going on there. So far, it seems pretty unintuitive service to me.

Upvotes: 1

Views: 889

Answers (1)

Benjamin Trent
Benjamin Trent

Reputation: 7566

The agg as you have it written seems to be filtering publish_datetime buckets so that you only include those in the year 2021 to do that you must nest the sub-agg under that particular terms aggregation.

Like so:

GET articles/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "word_counts": {
      "terms": {
        "field": "content"
      }
    },
    "publish_datetime": {
      "terms": {
        "field": "publish_datetime"
      }
      "aggs": {
        "word_counts_2021": {
          "bucket_selector": {
            "buckets_path": {
              "pd": "publish_datetime"
            },
            "script": "LocalDateTime.parse(params.pd).getYear() == 2021"
          }
        }
      }
    }
  }
}

But, if that field has a date time type, I would suggest simply filtering with a range query and then aggregating your documents.

Upvotes: 2

Related Questions