AOK
AOK

Reputation: 503

ElasticSearch aggregating by a nested field with variable nesting (or over particular json field)

I have the following structure GET /index-*/_mapping:

    "top_field" : {
      "properties" : {
        "dict_key1" : {
          "properties" : {
            "field1" : {...},
            "field2" : {...},
            "field3" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "field4" : {...}
          },
        "dict_key2" : {
          "properties" : {
            "field1" : {...},
            "field2" : {...},
            "field3" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "field4" : {...}
          },
        "dict_key3": ...
        }

In other words, top_field stores a json.

I would like to aggregate over 'field3.keyword' regardless of dict_key*. Something like top_field.*.field3.keyword.

However, I can't get it to work using terms aggregation, with or without nested. I also tried to just to bucket by the different dict_key*, which would be almost as good, but I can't get this to work either.

How can I do this?

Upvotes: 0

Views: 433

Answers (1)

Joe - Check out my books
Joe - Check out my books

Reputation: 16923

TL;DR I had the same problem some time ago (Terms aggregation with nested wildcard path) and it turns out it's not directly possible due to the way lookups and path accessors are performed.


There's a scripted workaround though:

{
  "size": 0,
  "aggs": {
    "terms_emulator": {
      "scripted_metric": {
        "init_script": "state.keyword_counts = [:]",
        "map_script": """
          def source = params._source['top_field'];
          for (def key : source.keySet()) {
            if (!source[key].containsKey('field3')) continue;
            
            def field3_kw = source[key]['field3'];
        
            if (state.keyword_counts.containsKey(field3_kw)) { 
              state.keyword_counts[field3_kw] += 1;
            } else {
              state.keyword_counts[field3_kw] = 1;
            }
          }
        """,
        "combine_script": "state",
        "reduce_script": "states[0]"
      }
    }
  }
}

yielding something along the lines of

"aggregations" : {
  "terms_emulator" : {
    "value" : {
      "keyword_counts" : {
        "world" : 1,
        "hello" : 2
      }
    }
  }
}

While this works just fine, I'd disadvise using scripts in production. You could rather restructure your data such that straightforward lookups are possible. For instance:

{
  "top_field": {
    "entries": [
      {
        "group_name": "dict_key1",
        "key_value_pairs": {
          "field3": "hello"
        }
      },
      {
        "group_name": "dict_key2",
        "key_value_pairs": {
          "field3": "world"
        }
      }
    ]
  }
}

and make entries nested. Maybe even ditch top_field since it seems redundant and start directly w/ entries.

Upvotes: 1

Related Questions