How does elasticsearch prevent cascading failure after node outage due to disk pressure?

Question

We operate an elasticsearch stack which uses 3 nodes to store log data. Our current config is to have indices with 3 primaries and 1 replica. (We have just eyeballed this config and are happy with the performance, so we decided to not (yet) spend time for optimization)

After a node outage (let's assume a full disk), I have observed that elasticsearch automatically redistributes its shards to the remaining instances - as advertised.

However this increases disk usage on the remaining two instances, making it a candidate for cascading failure.

Durability of the log data is not paramount. I am therefore thinking about reconfiguring elasticsearch to not create a new replica after a node outage. Instead, it could just run on the primaries only. This means that after a single node outage, we would run without redundancy. But that seems better than a cascading failure. (This is a one time cost)

An alternative would be to just increase disk size. (This is an ongoing cost)

My question

(How) can I configure elasticsearch to not create new replicas after the first node has failed? Or is this considered a bad idea and the canonical way is to just increase disk capacity?

How does elasticsearch prevent cascading failure after node outage due to disk pressure?

My question

Answers (1)

Rebalancing is expensive

Only rebalance when required

Related Questions