Reputation: 741
We need to restart all nodes in our elasticsearch cluster as they need to be patched. We need zero downtime so can't stop all the data nodes together (we have 1 primary and 1 replica with 150 indexes across 20 data nodes and 3 masters using ES 2.4.4).
The standard approach recommends restarting each node individually, waiting until it's back up and balanced then repeating the process for all the nodes: https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_rolling_restarts.html
This is going to take too long as as we have around 80 shards per node, and it's taking a while to get them reallocated. Are there any tools that can identify how we could reboot multiple data nodes simultaneously (i.e. identify groups of nodes where the primary and secondary shards are not in that subset)?
Or is there any other approach to achieving the same?
Upvotes: 1
Views: 1699
Reputation: 11013
tldr; Since you have only one replica per shard, you most likely can't take more than one node down and not go to red (for at least one of your indices).
This is an exercise in combinatorics. When you take one node down, you need the rest of the nodes to serve the 80 shards hosted by that node.
Say you want to take another one of the 19 live nodes down and still not go to red status for any of your indices. This is possible only if that node doesn't host any one of those 80 shards. Let's compute this probability.
Probability that a live node doesn't have one of those 80 shards = (18/19)
Probability that a live node has none of the 80 shards = (18/19)^80 = 0.013 = 1.3%
So if you take another node down, the probability with which you go to red (for at least one index) is 98.7%.
If you take all your 19 live nodes in aggregate, then you only have a 1.3% * 19 = 25% chance of finding another node to take down and still not go to red. (I am not entirely sure about this last statement though due to independence assumptions, but I believe it conveys the idea.)
Upvotes: 1