Reputation: 795
I create a Curator client as follows:
RetryPolicy retryPolicy = new RetryNTimes(3, 1000);
CuratorFramework client = CuratorFrameworkFactory.newClient(zkConnectString,
15000, // sessionTimeoutMs
15000, // connectionTimeoutMs
retryPolicy);
When running my client program I simulate a network partition by bringing down the NIC that Curator is using to communicate with Zookeeper. I have a few questions based on the behavior that I am seeing:
ConnectionStateManager - State change: SUSPENDED
message after 10 seconds. Is the amount of time until Curator enters the SUSPENDED state configurable, based on a percentage of the other timeout values, or always 10 seconds?ZooKeeper - Session: 0x14adf3f01ef0001 closed
message in the log, however this does not appear to trickle up as an event that I can capture or listen on. Am I missing something here?ConnectionStateManager - State change: LOST
message almost two minutes after the connection loss. Why so long?SUSPENDED
message is received, since it is entirely possible that Zookeeper has released the lock
unbeknownst to it on the other side of the network partition. Is this a typical/sane approach?Upvotes: 5
Views: 5309
Reputation: 1
The first question, Zookeeper has a variable called MAX_SEND_PING_INTERVAL which is 10 seconds, so It's always 10 seconds on your condition.The code is in the ClientCnxn class.
//1000(1 second) is to prevent race condition missing to send the second ping
//also make sure not to send too many pings when readTimeout is small
int timeToNextPing = readTimeout / 2 - clientCnxnSocket.getIdleSend() -
((clientCnxnSocket.getIdleSend() > 1000) ? 1000 : 0);
//send a ping request either time is due or no packet sent out within MAX_SEND_PING_INTERVAL
if (timeToNextPing <= 0 || clientCnxnSocket.getIdleSend() > MAX_SEND_PING_INTERVAL) {
sendPing();
clientCnxnSocket.updateLastSend();
} else {
if (timeToNextPing < to) {
to = timeToNextPing;
}
}
Upvotes: 0
Reputation: 2956
It depends which version of Curator you're using (note: I'm the main author of Curator)...
In Curator 2.x, the LOST state means that a retry policy has been exhausted. It does not mean that the Session has been lost. In ZooKeeper the session is only determined to be lost once the connection to the ensemble is repaired. So, you get SUSPENDED when Curator sees the first "Disconnected" message. Then, when an operation fails due to the retry policy giving up you get LOST.
In Curator 3.x the meaning of LOST was changed. In 3.x when the "Disconnected" is received Curator starts an internal timer. When the timer passes the negotiated session timeout Curator calls getTestable().injectSessionExpiration() and posts a LOST state change.
Upvotes: 2
Reputation: 247
Correct. Assume leadership has been lost on SUSPEND and LOST. This is the way the Apache Curator recipes work. You may want to use Apache Curator rather than implementing your own algorithm. https://curator.apache.org/curator-recipes/index.html
Upvotes: 0