Reputation: 1860
Zookeeper is not properly recovering after timeout and falling into a non working state without restarting.
What can cause this, how can we resolve it?
config
ZOO_TICK_TIME: 2000
ZOO_INIT_LIMIT: 30000
ZOO_SYNC_LIMIT: 10
ZOO_MAX_CLIENT_CNXNS: 2000
ZOO_STANDALONE_ENABLED: 'false'
ZOO_AUTOPURGE_PURGEINTERVAL: 1
ZOO_AUTOPURGE_SNAPRETAINCOUNT: 10
ZOO_LOG4J_PROP: INFO,ROLLINGFILE
log
2020-01-10 20:58:57,473 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x2200061493 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x22 (n.peerEPoch), FOLLOWING (my state)0 (n.config version)
2020-01-10 20:59:57,484 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x2200061493 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x22 (n.peerEPoch), FOLLOWING (my state)0 (n.config version)
2020-01-10 21:00:57,495 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x2200061493 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x22 (n.peerEPoch), FOLLOWING (my state)0 (n.config version)
2020-01-10 21:01:07,510 [myid:1] - WARN [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Follower@96] - Exception when following the leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:158)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1271)
2020-01-10 21:01:07,516 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Follower@201] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:201)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1275)
2020-01-10 21:01:07,516 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):LearnerZooKeeperServer@165] - Shutting down
2020-01-10 21:01:07,517 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@558] - shutting down
2020-01-10 21:01:07,517 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FollowerRequestProcessor@139] - Shutting down
2020-01-10 21:01:07,518 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):CommitProcessor@362] - Shutting down
2020-01-10 21:01:07,518 [myid:1] - INFO [FollowerRequestProcessor:1:FollowerRequestProcessor@110] - FollowerRequestProcessor exited loop!
2020-01-10 21:01:07,518 [myid:1] - INFO [CommitProcessor:1:CommitProcessor@195] - CommitProcessor exited loop!
2020-01-10 21:01:07,518 [myid:1] - INFO [CommitProcessor:1:CommitProcessor@195] - CommitProcessor exited loop!
2020-01-10 21:01:07,526 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete
2020-01-10 21:01:07,659 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):SyncRequestProcessor@191] - Shutting down
2020-01-10 21:01:07,660 [myid:1] - INFO [SyncThread:1:SyncRequestProcessor@169] - SyncRequestProcessor exited!
2020-01-10 21:01:07,661 [myid:1] - WARN [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1318] - PeerState set to LOOKING
2020-01-10 21:01:07,661 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1193] - LOOKING
2020-01-10 21:01:07,664 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] - New election. My id = 1, proposed zxid=0x23007a7d69
2020-01-10 21:01:07,666 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 1 (n.leader), 0x23007a7d69 (n.zxid), 0x2 (n.round), LOOKING (n.state), 1 (n.sid), 0x23 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2020-01-10 21:01:07,668 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 3 (n.leader), 0x2200061493 (n.zxid), 0x1 (n.round), LEADING (n.state), 3 (n.sid), 0x23 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2020-01-10 21:01:07,668 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 1 (n.leader), 0x23007a7d69 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x23 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2020-01-10 21:01:07,870 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1281] - LEADING
2020-01-10 21:01:07,885 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@66] - TCP NoDelay set to: true
2020-01-10 21:01:07,885 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@86] - zookeeper.leader.maxConcurrentSnapshots = 10
2020-01-10 21:01:07,885 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@88] - zookeeper.leader.maxConcurrentSnapshotTimeout = 5
2020-01-10 21:01:07,890 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@938] - minSessionTimeout set to 4000
2020-01-10 21:01:07,890 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@947] - maxSessionTimeout set to 40000
2020-01-10 21:01:07,890 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@166] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /datalog/version-2 snapdir /data/version-2
2020-01-10 21:01:07,900 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@464] - LEADING - LEADER ELECTION TOOK - 29 MS
2020-01-10 21:01:07,918 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FileTxnSnapLog@384] - Snapshotting: 0x23007a7d69 to /data/version-2/snapshot.23007a7d69
2020-01-10 21:01:07,964 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,040 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,318 [myid:1] - WARN [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,491 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,491 [myid:1] - WARN [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,501 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,501 [myid:1] - WARN [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,509 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,510 [myid:1] - WARN [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,519 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 3 (n.leader), 0x23007a7d72 (n.zxid), 0x2 (n.round), LOOKING (n.state), 3 (n.sid), 0x23 (n.peerEPoch), LEADING (my state)0 (n.config version)
2020-01-10 21:01:08,524 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,524 [myid:1] - WARN [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,535 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,535 [myid:1] - WARN [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,545 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,545 [myid:1] - WARN [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,572 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,574 [myid:1] - WARN [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,582 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,583 [myid:1] - WARN [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,594 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,594 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
2020-01-10 21:01:08,595 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
...
Upvotes: 1
Views: 4875
Reputation: 158
This
2020-01-10 21:01:07,900 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@464] - LEADING - LEADER ELECTION TOOK - 29 MS
2020-01-10 21:01:07,918 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FileTxnSnapLog@384] - Snapshotting: 0x23007a7d69 to /data/version-2/snapshot.23007a7d69
is telling me, the zk node became leader and it started writing snapshot (after that, it is probably going to send snapshot to followers). Then node will start serving queries. But if snapshot is big (or /data/version-2/ underlying hardware is slow) writing will take significant amount of time. If it takes longer than maxSessionTimeout 40000
, all active sessions and all ephemeral data will be lost.
If that is the case, then the only good way to avoid it is to reduce amount of data stored in zookeeper. For example, store heavy blobs somewhere else (file/nfs/hdfs/aws/sql/http/...), and in zk only write timestamps (and path in storage maybe), so that the application still could receive data-change-notications quickly from zk.
Upvotes: 0
Reputation: 1523
For the typical symptomatic categories listed above, use the lists below to check if you are seeing frequent disconnects of healthy clients
Network connectivity problems:
I/O starvation
Reference link : https://mapr.com/support/s/article/How-do-I-troubleshoot-ZK-connection-timeout-issues?language=en_US
Upvotes: 2