user1051849
user1051849

Reputation: 2337

ElasticSearch - How to Aggregate the Geometric Mean?

I'm currently aggregating records to get the average (arithmetic average) of a field in the returned records. My use case requires me to get hold of the geometric average:

The geometric mean is defined as the nth root of the product of n

How could I go about getting this value? I don't even know where to start!

Thanks!

Upvotes: 1

Views: 448

Answers (1)

Nikolay Vasiliev
Nikolay Vasiliev

Reputation: 6076

It is not trivial, but it can be done. The idea is to use a sum of logs and then apply the n-th root:

pow(exp((sum of logs)), 1/n)

In fact, GeometricMean aggregation of Elasticsearch Index Termlist Plugin does exactly that. (However, this is a third-party plugin, I can't tell if it is stable enough.)

Mapping and sample data

Let's assume we have the following mapping:

PUT geom_mean
{
  "mappings": {
    "nums": {
      "properties": {
        "x": {
          "type": "double"
        }
      }
    }
  }
}

And we insert the following documents:

{"x":33}
{"x":324}
{"x":134}
{"x":0.1}

Now we can try the query.

The ES query

Here is the query to calculate geometric mean:

POST geom_mean/nums/_search
{
  "size": 0,
  "aggs": {
    "aggs_root": {
      "terms": {
        "script": "'Bazinga!'"
      },
      "aggs": {
        "sum_log_x": {
          "sum": {
            "script": {
              "inline": "Math.log(doc.x.getValue())"
            }
          }
        },
        "geom_mean": {
          "bucket_script": {
            "buckets_path": {
              "sum_log_x": "sum_log_x",
              "x_cnt": "_count"
            },
            "script": "Math.pow(Math.exp(params.sum_log_x), 1 / params.x_cnt)"
          }
        }
      }
    }
  }
}

The return value will be:

  "aggregations": {
    "aggs_root": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Bazinga!",
          "doc_count": 4,
          "sum_log_x": {
            "value": 11.872505784215674
          },
          "geom_mean": {
            "value": 19.455434622111177
          }
        }
      ]
    }
  }

Now a bit of explanation. Aggregation sum_log_x computes the sum of x. Aggregation named geom_mean is a sibling pipeline aggregation which is applied on the result of sum_log_x aggregation (its sibling). It uses special bucket path _count to get the number of elements. (Here you can read about bucket_script aggregation a bit more.)

The final trick is to wrap both of them with some aggregation, because, as explained in this issue, bucket_script cannot be a top-level aggregation. Here I do a terms aggregation on a script that always returns 'Bazinga!'

Thanks to anhzhi who proposed this hack.

Important considerations

Since the geometric mean is computed through logs, all x values should be greater than 0. However:

  • if any of values are < 0, the result is "NaN"
  • if all values are non-negative and less than "+Infinity", but at least one value is 0, the result is "-Infinity"
  • if both "+Infinity" and "-Infinity" are among the values, the result is "NaN".

The queries were tested with Elasticsearch 5.4. Performance on a large collection of documents was not tested, you might consider inserting x together with its log to make aggregations more efficient.

Hope that helps!

Upvotes: 1

Related Questions