simon
simon

Reputation: 273

How does elasticsearch prevent cascading failure after node outage due to disk pressure?

We operate an elasticsearch stack which uses 3 nodes to store log data. Our current config is to have indices with 3 primaries and 1 replica. (We have just eyeballed this config and are happy with the performance, so we decided to not (yet) spend time for optimization)

After a node outage (let's assume a full disk), I have observed that elasticsearch automatically redistributes its shards to the remaining instances - as advertised.

However this increases disk usage on the remaining two instances, making it a candidate for cascading failure.

Durability of the log data is not paramount. I am therefore thinking about reconfiguring elasticsearch to not create a new replica after a node outage. Instead, it could just run on the primaries only. This means that after a single node outage, we would run without redundancy. But that seems better than a cascading failure. (This is a one time cost)

An alternative would be to just increase disk size. (This is an ongoing cost)

My question

(How) can I configure elasticsearch to not create new replicas after the first node has failed? Or is this considered a bad idea and the canonical way is to just increase disk capacity?

Upvotes: 1

Views: 130

Answers (1)

simon
simon

Reputation: 273

Rebalancing is expensive

When a node leaves the cluster, some additional load is generated on the remaining nodes:

  • Promoting a replica shard to primary to replace any primaries that were on the node.
  • Allocating replica shards to replace the missing replicas (assuming there are enough nodes).
  • Rebalancing shards evenly across the remaining nodes.

This can lead to quite some data being moved around.

Sometimes, a node is only missing for a short period of time. A full rebalance is not justified in such a case. To take account for that, when a node goes down, then elasticsearch immediatelly promotes a replica shard to primary for each primary that was on the missing node, but then it waits for one minute before creating new replicas to avoid unnecessary copying.

Only rebalance when required

The duration of this delay is a tradeoff and can therefore be configured. Waiting longer means less chance of useless copying but also more chance for downtime due to reduced redundancy.

Increasing the delay to a few hours results in what I am looking for. It gives our engineers some time to react, before a cascading failure can be created from the additional rebalancing load.

I learned that from the official elasticsearch documentation.

Upvotes: 1

Related Questions