Reputation: 956

What should I do when a zookeeper node is back to normal?

I have a zookeeper cluster with 3 nodes: zk01, zk02 and zk03. To do maintenance job, I shut down zk01 and replace it with a new node, which is still called zk01. However, I got error message "This ZooKeeper instance is not currently serving requests" when I ran "echo stat | nc zk01 2181". So I tried this command for zk02 and zk03:

[email protected] ~ # echo stat | nc zk02 2181
Zookeeper version: 3.3.5-cdh3u6--1, built on 03/20/2013 20:28 GMT
Clients:
 /10.18.5.187:36772[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/1/67
Received: 23938
Sent: 23937
Outstanding: 0
Zxid: 0x3000f68e2
Mode: follower
Node count: 1453
[email protected] ~ # echo stat | nc zk02 2181
Zookeeper version: 3.3.5-cdh3u6--1, built on 03/20/2013 20:28 GMT
Clients:
 /10.18.5.187:36773[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/1/67
Received: 23939
Sent: 23938
Outstanding: 0
Zxid: 0x3000f68e2
Mode: follower
Node count: 1453

[email protected] ~ # echo stat | nc zk03 2181
Zookeeper version: 3.3.5-cdh3u6--1, built on 03/20/2013 20:28 GMT
Clients:
 /10.18.5.224:35190[1](queued=0,recved=19246695,sent=19255810)
 /10.18.5.225:51732[1](queued=0,recved=1902803,sent=1911886)
 /10.18.5.187:44885[0](queued=0,recved=1,sent=0)
 /10.18.8.125:53937[1](queued=0,recved=1529,sent=1532)

Latency min/avg/max: 0/0/105
Received: 21223069
Sent: 21241269
Outstanding: 0
Zxid: 0x3000f68e2
Mode: leader
Node count: 1453

'10.18.5.187' is the IP address for zk01. My question is, is zk01 in my zookeeper cluster now? If so, why it gave message like it's not serving requests. If not, what should I do to add it to cluster?

Upvotes: 3

Answers (2)

Vijay Kumar

Reputation: 2707

The zookeepers need to be started in the order they are listed in the config file. So shutdown all the servers, then start

server.1
server.2
server.3

Upvotes: 1

redstonemercury

Reputation: 364

I am having this exact same issue.

I see the new IP listed in a stat of the other two servers just like above, but there are not the proper snapshots/transaction logs in the data directories, so I am assuming the new server is not properly joined into the cluster.

Based on https://issues.apache.org/jira/browse/ZOOKEEPER-338 (which is for the client, but based on the details in the ticket is apparently an issue for the server too) it sounds like zookeeper does not re-run DNS resolution once it's determined the IP of a host. At least for versions before 3.5.0; it sounds like 3.5.0 has the opposite issue of re-resolving every call and slowing down zookeeper.

This means (for pre 3.5.0) if you redeploy a node with the same hostname but different IP, the existing instances running zookeeper will not update that host to the new IP.

With this in mind, the two options I see are:

Stop all instances of zookeeper (taking the quorum down) then start it back up and see if the issue is fixed. You would take zookeeper offline for this, so not really a viable option in a production deployment.
Don't reuse hostnames; provision zk04 instead of zk01 and update the zoo.cfg and myid files on the new zk04 appropriately.

I have to check whether my zookeeper quorum is in production use yet before attempting the first option (which is my preference since I like my hostnames consistent) but will update this thread with an answer as to whether that fixed the issue or not in the next few days.

Update: Stopping zookeeper on all the nodes, then starting back up one at a time fixed this issue. If you are able to take a downtime it's an easy path to fix the issue.

Upvotes: 3

What should I do when a zookeeper node is back to normal?

Answers (2)

Related Questions