ElasticSearch aggregating by a nested field with variable nesting (or over particular json field)

Question

I have the following structure GET /index-*/_mapping:

    "top_field" : {
      "properties" : {
        "dict_key1" : {
          "properties" : {
            "field1" : {...},
            "field2" : {...},
            "field3" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "field4" : {...}
          },
        "dict_key2" : {
          "properties" : {
            "field1" : {...},
            "field2" : {...},
            "field3" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "field4" : {...}
          },
        "dict_key3": ...
        }

In other words, top_field stores a json.

I would like to aggregate over 'field3.keyword' regardless of dict_key*. Something like top_field.*.field3.keyword.

However, I can't get it to work using terms aggregation, with or without nested. I also tried to just to bucket by the different dict_key*, which would be almost as good, but I can't get this to work either.

How can I do this?

Joe - Check out my books · Accepted Answer

TL;DR I had the same problem some time ago (Terms aggregation with nested wildcard path) and it turns out it's not directly possible due to the way lookups and path accessors are performed.

There's a scripted workaround though:

{
  "size": 0,
  "aggs": {
    "terms_emulator": {
      "scripted_metric": {
        "init_script": "state.keyword_counts = [:]",
        "map_script": """
          def source = params._source['top_field'];
          for (def key : source.keySet()) {
            if (!source[key].containsKey('field3')) continue;
            
            def field3_kw = source[key]['field3'];
        
            if (state.keyword_counts.containsKey(field3_kw)) { 
              state.keyword_counts[field3_kw] += 1;
            } else {
              state.keyword_counts[field3_kw] = 1;
            }
          }
        """,
        "combine_script": "state",
        "reduce_script": "states[0]"
      }
    }
  }
}

yielding something along the lines of

"aggregations" : {
  "terms_emulator" : {
    "value" : {
      "keyword_counts" : {
        "world" : 1,
        "hello" : 2
      }
    }
  }
}

While this works just fine, I'd disadvise using scripts in production. You could rather restructure your data such that straightforward lookups are possible. For instance:

{
  "top_field": {
    "entries": [
      {
        "group_name": "dict_key1",
        "key_value_pairs": {
          "field3": "hello"
        }
      },
      {
        "group_name": "dict_key2",
        "key_value_pairs": {
          "field3": "world"
        }
      }
    ]
  }
}

and make entries nested. Maybe even ditch top_field since it seems redundant and start directly w/ entries.

ElasticSearch aggregating by a nested field with variable nesting (or over particular json field)

Answers (1)

Related Questions