Elastic search aggregation on map - on each key

Question

I have following kind of documents.

document 1

{
  "doc": {
    "id": 1,
    "errors": {
      "e1":5,
      "e2":20,
      "e3":30
    },
    "warnings": {
      "w1":1,
      "w2":2
    }
  }
}

document 2

{
  "doc": {
    "id": 2,
    "errors": {
      "e1":10
    },
    "warnings": {
      "w1":1,
      "w2":2,
      "w3":33,
    }
  }
}

I would like to get following sum stats in one or more calls. Is it possible? I tried various solution but all works when key is known. In my case map keys (e1, e2 etc) are not known.

{
  "errors": {
    "e1": 15,
    "e2": 20,
    "e3": 30
  },
  "warnings": {
    "w1": 2,
    "w2": 4,
    "w3": 33
  }
}

Nikolay Vasiliev · Accepted Answer

There are two solutions, none of them are pretty. I have to point out that the option 2 should be the preferred way to go since option 1 uses an experimental feature.

1. Dynamic mapping, [experimental] scripted aggregation

Inspired by this answer and the Scripted Metric Aggregation page of ES docs, I began with just inserting your documents to non-existing index (which by default creates dynamic mapping).

NB: I tested this on ES 5.4, but the documentation suggests that this feature is available from at least 2.0.

The resulting query for aggregation is the following:

POST /my_index/my_type/_search
{
    "size": 0,
    "query" : {
        "match_all" : {}
    },
    "aggs": {
        "errors": {
            "scripted_metric": {
                "init_script" : "params._agg.errors = [:]",
                "map_script" : "for (t in params['_source']['doc']['errors'].entrySet()) { params._agg.errors[t.key] = t.value } ",
                "combine_script" : "return params._agg.errors",
                "reduce_script": "Map res = [:] ; for (a in params._aggs) { for (t in a.entrySet()) { res[t.key] = res.containsKey(t.key) ? res[t.key] + t.value : t.value } }  return res"

            }
        },
        "warnings": {
            "scripted_metric": {
                "init_script" : "params._agg.errors = [:]",
                "map_script" : "for (t in params['_source']['doc']['warnings'].entrySet()) { params._agg.errors[t.key] = t.value } ",
                "combine_script" : "return params._agg.errors",
                "reduce_script": "Map res = [:] ; for (a in params._aggs) { for (t in a.entrySet()) { res[t.key] = res.containsKey(t.key) ? res[t.key] + t.value : t.value } }  return res"

            }
        }
    }
}

Which produces this output:

{
   ...
   "aggregations": {
      "warnings": {
         "value": {
            "w1": 2,
            "w2": 4,
            "w3": 33
         }
      },
      "errors": {
         "value": {
            "e1": 15,
            "e2": 20,
            "e3": 30
         }
      }
   }
}

If you are following this path you might be interested in the JavaDoc of what params['_source'] is underneath.

Warning: I believe that scripted aggregation is not efficient and for better performance you should check out the option 2 or a different data processing engine.

What does experimental mean:

This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.

With this in mind we proceed to option 2.

2. Static nested mapping, nested aggregation

Here the idea is to store your data differently and essentially be able to query and aggregate it differently. Firstly, we need to create a mapping using nested data type.

PUT /my_index_nested/
{
    "mappings": {
        "my_type": {
            "properties": {
                "errors": { 
                    "type": "nested",
                    "properties": {
                        "name": {"type": "keyword"},
                        "val": {"type": "integer"}
                    }
                },
                "warnings": { 
                    "type": "nested",
                    "properties": {
                        "name": {"type": "keyword"},
                        "val": {"type": "integer"}
                    }
                }
            }
        }
    }
}

A document in such an index will look like this:

     {
        "_index": "my_index_nested",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
           "errors": [
              {
                 "name": "e1",
                 "val": 5
              },
              {
                 "name": "e2",
                 "val": 20
              },
              {
                 "name": "e3",
                 "val": 30
              }
           ],
           "warnings": [
              {
                 "name": "w1",
                 "val": 1
              },
              {
                 "name": "w2",
                 "val": 2
              }
           ]
        }
     }

Next we need to write the aggregate query. First we need to use nested aggregation, which will allow us to query this special nested data type. But since we actually want to aggregate by name, and sum the values of val, we will need to do a sub-aggregation.

The resulting query is as follows (I am adding comments alongside the query for clarity):

POST /my_index_nested/my_type/_search
{
    "size": 0,
    "aggs": {
        "errors_top": {
            "nested": {
                // declare which nested objects we want to work with
                "path": "errors"  
            },
            "aggs": {
                "errors": {
                    // what we are aggregating - different values of name
                    "terms": {"field": "errors.name"}, 
                    // sub aggregation
                    "aggs": { 
                        "error_sum": {
                            // sum all val for same name
                            "sum": {"field": "errors.val"} 
                        }
                        
                    }
                }
            }
        },
        "warnings_top": {
            // analogous to errors
        }
    }
}

The output of this query will be like:

{
   ...
   "aggregations": {
      "errors_top": {
         "doc_count": 4,
         "errors": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "e1",
                  "doc_count": 2,
                  "error_sum": {
                     "value": 15
                  }
               },
               {
                  "key": "e2",
                  "doc_count": 1,
                  "error_sum": {
                     "value": 20
                  }
               },
               {
                  "key": "e3",
                  "doc_count": 1,
                  "error_sum": {
                     "value": 30
                  }
               }
            ]
         }
      },
      "warnings_top": {
         ...
      }
   }
}

Elastic search aggregation on map - on each key

Answers (1)

1. Dynamic mapping, [experimental] scripted aggregation

2. Static nested mapping, nested aggregation

Related Questions