DevOps_101
DevOps_101

Reputation: 269

Replacing Bad Node in Zookeeper Quorum Safely

We have 5 node zookeeper quorum ( A,B,C,D,E ) running in production, 1 node went down last week( E ) . quorum is healthy but we need to replace ( E ) with new healthy node ( F )

I am juggling between 2 options

1. add ( F ) to the quorum and then remove  ( E )
2. replace ( F ) with ( E ) restart followers and then restart leader

I tested Option #2, I can see that ( F ) is accepted in quorum after leadership is forced ( by restarting leader )

Quorum is healthy, but I just wanted to make sure if this is standard procedure

I dont find any apache documentation about node replacement for this version

ZK Version : 3.4.6

Upvotes: 5

Views: 3591

Answers (2)

Eno Thereska
Eno Thereska

Reputation: 244

If in your example node F can be brought up to have the same IP* and ID (written in zookeeper data as myid file) as the failed node E, then no further action is needed. The new node F will initially have no data, but it will receive the latest data from the other available nodes. I have verified with Zookeeper version 3.4.10.

*This scenario is possible for example on AWS where you can reserve IP addresses for Zookeeper nodes through ENIs. So a new node F can be given the same IP address as a failed node E.

Upvotes: 1

Sachin Lala
Sachin Lala

Reputation: 735

Yes, for versions prior to 3.5.*, reconfiguration of a ZK cluster requires coordinated restarts after ensuring the configuration is updated to replace the old node with the new one, so that the new node(s) could join the quorom and old one is removed. I had found this gist helpful.

In general, for upgrades also, it's recommended to go with rolling restarts - reference apache link.

If possible, I suggest you consider upgrading to 3.5* version wherein dynamic reconfiguration is possible without any restarts.

Upvotes: 5

Related Questions