antoinestv
antoinestv

Reputation: 3306

Elasticsearch multiple sum aggregations

We have a lot of documents in each index (~10 000 000). But each document is very small and contains almost only integer values.

We needed to SUM all numerical field.

  1. First step - We ask for all available fields with a mapping.

Example :

GET INDEX/TYPE/_mapping
  1. Second step - We build the request with the fields from the mapping.

Example :

GET INDEX/TYPE/_search
{
    // SOME FILTERS TO REDUCE THE NUMBER OF DOCUMENTS
    "size":0,
    "aggs":{  
        "FIELD 1":{  
            "sum":{  
                "field":"FIELD 1"
            }
        },
        "FIELD 2":{  
            "sum":{  
                "field":"FIELD 2"
            }
        },
        // ...
        "FIELD N":{  
            "sum":{  
                "field":"FIELD N"
            }
        }
    }
}

Our problem is that the second request execution time is linear with the number of field N.

That's not acceptable as this is only sums. So we tried to generate our own aggregation with a scripted metric (groovy).

Exemple with only 2 fields :

// ...
"aggs": {
    "test": {
        "scripted_metric": {
            "init_script": "_agg['t'] = []",
            "map_script": "_agg.t.add(doc)",
            "combine_script": "res = [:]; res['FIELD 1'] = 0; res['FIELD 2'] = 0; for (t in _agg.t) { res['FIELD 1'] += t.['FIELD 1']; res['FIELD 2'] += t.['FIELD 2']; }; return res", 
            "reduce_script": "res = [:]; res['FIELD 1'] = 0; res['FIELD 2'] = 0; for (t in _aggs) { res['FIELD 1'] += t.['FIELD 1']; res['FIELD 2'] += t.['FIELD 2']; }; return res"
        }
    }
}
// ...

But it appears that the more affectations we add in the script, the more time it takes to execute it, so it doesn't solve our problem.

There is not a lot of example out there.

Do you have some ideas to improve this script performances ? Or other ideas ?

Upvotes: 4

Views: 2027

Answers (1)

NikoNyrh
NikoNyrh

Reputation: 4138

How could it calculate N sums in sub-linear time, does any such system exist?

10 million document's isn't actually that many. How long are your queries taking, how many shards do you have and is the CPU maxed at 100%? (I was gonna ask these in a comment but don't have 50 reputation yet).

If you are interested in the total sum of all fields you could pre-calculate document-level sums when you are indexing the document and then at query time just take the sum of these values.

You could also try storing fields as doc_values and see if it helps. You would have less memory pressure and garbage collection, although docs mention a possible 10 - 25% performance hit.

Upvotes: 1

Related Questions