Reputation: 2337
I'm currently aggregating records to get the average (arithmetic average) of a field in the returned records. My use case requires me to get hold of the geometric average:
The geometric mean is defined as the nth root of the product of n
How could I go about getting this value? I don't even know where to start!
Thanks!
Upvotes: 1
Views: 448
Reputation: 6076
It is not trivial, but it can be done. The idea is to use a sum of logs and then apply the n-th root:
pow(exp((sum of logs)), 1/n)
In fact, GeometricMean aggregation of Elasticsearch Index Termlist Plugin does exactly that. (However, this is a third-party plugin, I can't tell if it is stable enough.)
Let's assume we have the following mapping:
PUT geom_mean
{
"mappings": {
"nums": {
"properties": {
"x": {
"type": "double"
}
}
}
}
}
And we insert the following documents:
{"x":33}
{"x":324}
{"x":134}
{"x":0.1}
Now we can try the query.
Here is the query to calculate geometric mean:
POST geom_mean/nums/_search
{
"size": 0,
"aggs": {
"aggs_root": {
"terms": {
"script": "'Bazinga!'"
},
"aggs": {
"sum_log_x": {
"sum": {
"script": {
"inline": "Math.log(doc.x.getValue())"
}
}
},
"geom_mean": {
"bucket_script": {
"buckets_path": {
"sum_log_x": "sum_log_x",
"x_cnt": "_count"
},
"script": "Math.pow(Math.exp(params.sum_log_x), 1 / params.x_cnt)"
}
}
}
}
}
}
The return value will be:
"aggregations": {
"aggs_root": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Bazinga!",
"doc_count": 4,
"sum_log_x": {
"value": 11.872505784215674
},
"geom_mean": {
"value": 19.455434622111177
}
}
]
}
}
Now a bit of explanation. Aggregation sum_log_x
computes the sum of x
. Aggregation named geom_mean
is a sibling pipeline aggregation which is applied on the result of sum_log_x
aggregation (its sibling). It uses special bucket path _count
to get the number of elements. (Here you can read about bucket_script
aggregation a bit more.)
The final trick is to wrap both of them with some aggregation, because, as explained in this issue, bucket_script
cannot be a top-level aggregation. Here I do a terms
aggregation on a script that always returns 'Bazinga!'
Thanks to anhzhi who proposed this hack.
Since the geometric mean is computed through logs, all x
values should be greater than 0. However:
"NaN"
"+Infinity"
, but at least one value is 0, the result is "-Infinity"
"+Infinity"
and "-Infinity"
are among the values, the result is "NaN"
.The queries were tested with Elasticsearch 5.4. Performance on a large collection of documents was not tested, you might consider inserting x
together with its log to make aggregations more efficient.
Hope that helps!
Upvotes: 1