Reputation: 305
I am using crate 1.0.2 which internally uses elasticsearch. So my question is applicable for both. For certain queries I get circuit break exception.
CircuitBreakingException: [parent] Data too large, data for [collect: 0] would be larger than limit of [11946544332/11.1gb]
These queries are mainly group by on multiple columns. I have billions of documents indexed and have 16 GB of RAM allocated as crate heap size. I have multiple such nodes connected together in a cluster. Will adding more nodes in the cluster help in getting rid of this error and will my same queries run fine ? Or is it that I must increase heap to 30 GB ? My worry is when I increase it to 30GB and as I add more data, someday that query will again hit the circuit breaker. So I wanted to solve it by scaling horizontally i.e. adding more nodes. Will that be wiser decision ?
Upvotes: 0
Views: 271
Reputation: 3971
Short answer: Usually horizontal scaling helps.
Your error seems to be caused by group by queries. The GROUP BY operations are executed in a distributed fashion. So more nodes will generally split the load and therefore also the memory usage. (Make sure there are enough shards so that they're spread among all nodes)
There is a catch though: Eventually the data needs to be merged together on the node you sent the initial query to. This is generally fine because the data arrives pre-aggregated, but If the cardinality is too high (Ex. GROUP BY on a primary key), the whole data set has to fit into memory on this coordinator node.
If your nodes have enough memory to go up to 30 GB (with still having enough to spare for the file system cache), I'd personally tend to increase the HEAP size first, before adding new nodes.
Update: Recent versions (2.1.X) also contain some fixes regarding the circuit-breaker behaviur. So if it's possible to update that'd be recommended as well.
Update2:
Note that there are different cases in which a circuit breaker can trip. In your case it's caused by a GROUP BY using up quite a lot of memory. But it can also be tripped if a single request is too large. For example if the bulk size is too large. In such a case more nodes wouldn't help. You'd have to reduce the bulk size.
Upvotes: 1