Reputation: 539
We recently began having sporadic issues at a client site where JSch began reporting "channel is not opened" during the call to sendChannelOpen after running successfully for years. After digging into the code and seeing the two options (timeout vs. failure message) it appeared most likely we were getting a failure.
Since the messages were the same, I downloaded the source and added some output statements to verify. After a few tests I was surprised to find that we were actually hitting the timeout section, but so quickly that it didn't make sense when reviewing the code and taking the wait() statement into account.
To further troubleshoot, I put a quick standalone app together to test using the following method as the core call:
public static void testConnectionJCraft(String host, String user, String password, int port, int timeout) throws Exception {
JSch jsch = new JSch();
Session session = jsch.getSession(user, host, port);
try {
session.setPassword(password);
session.setConfig("StrictHostKeyChecking", "no");
session.connect();
if (session.isConnected()) {
System.out.println("JCraft Connected sucessfully to server : " + host);
} else {
throw new Exception("JCraft Connection Failed: " + host);
}
Channel channel = session.openChannel("sftp");
try {
if (timeout > 0) {
channel.connect(timeout);
} else {
channel.connect();
}
} finally {
channel.disconnect();
}
} finally {
if (session != null) {
session.disconnect();
}
}
}
Note - I know the if (timeout > 0) is extraneous and I could just call channel.connect(timeout) and get the same result when timeout == 0, but I wanted to make sure the call mimicked the call within the app I am troubleshooting just in case.
I then added some additional statements to the sendChannelOpen method in com.jcraft.jsch.Channel:
protected void sendChannelOpen() throws Exception {
Session _session=getSession();
if(!_session.isConnected()){
throw new JSchException("session is down");
}
Packet packet = genChannelOpenPacket();
_session.write(packet);
int retry=2000;
long start=System.currentTimeMillis();
long timeout=connectTimeout;
long iteration = 0;
if(timeout!=0L) retry = 1;
System.out.println("Timeout: " + timeout);
long t = timeout== 0L ? 10L : timeout;
System.out.println("t: " + t);
synchronized(this){
while(this.getRecipient()==-1 &&
_session.isConnected() &&
retry>0){
if(timeout>0L){
long dur = (System.currentTimeMillis()-start);
if(dur>timeout){
System.out.println("Dur: " + dur + " > " + timeout);
retry=0;
continue;
}
}
try{
this.notifyme=1;
wait(t);
}
catch(java.lang.InterruptedException e){
System.out.println("Interrupted?");
}
finally{
this.notifyme=0;
}
retry--;
iteration++;
}
}
long end = System.currentTimeMillis();
System.out.println("Channel open duration: " + (end - start) + "ms");
System.out.println("Channel open iterations: " + iteration);
if(!_session.isConnected()){
throw new JSchException("session is down");
}
if(this.getRecipient()==-1){ // timeout
throw new JSchException("channel is not opened.");
}
if(this.open_confirmation==false){ // SSH_MSG_CHANNEL_OPEN_FAILURE
throw new JSchException("channel unable to be opened, SSH_MSG_CHANNEL_OPEN_FAILURE received.");
}
connected=true;
}
I did move the definition of t up just to avoid getting spammed the value. The results on the attempts that were reporting the "channel is not opened" message showed that the wait(t) statement above doesn't appear to be respecting the timeout. An example of the output (server name x'd out):
JCraft Connected sucessfully to server : xxx.xxxx.xxx
Timeout: 0
t: 10
Channel open duration: 8ms
Channel open iterations: 2000
ERROR: channel is not opened.
com.jcraft.jsch.JSchException: channel is not opened.
at com.jcraft.jsch.Channel.sendChannelOpen(Channel.java:783)
at com.jcraft.jsch.Channel.connect(Channel.java:153)
at com.jcraft.jsch.Channel.connect(Channel.java:147)
at com.gorman.tools.FTPTestUtility.testConnectionJCraft(FTPTestUtility.java:46)
at com.gorman.tools.FTPTestUtility.main(FTPTestUtility.java:21)
So in this case, the request ran through all 2000 iterations in a total of 8ms. There are some times where the attempts are successful and the wait() appears to be respected:
JCraft Connected sucessfully to server : xxx.xxxx.xxx
Timeout: 0
t: 10
Notifying all for setRecipient
Channel open duration: 21ms
Channel open iterations: 3
Here we have 3 iterations, and when I log them individually I get a 10/11/0 split on the times which is what I would expect. I did add statements to the areas calling notifyAll() as well to make sure inadvertent calls weren't causing the wait to continue, but no calls were made except when the recipient ID was being set - as expected.
Oddly, if I pass a timeout value >= 1200ms to the call, I can't get this to occur. I am wondering if this is just a byproduct of the quirkiness with wait(), specifically this note in the JavaDoc regarding the timeout value:
As I said, we've had this running for years with the default call to channel.connect() in place, but I am wondering if some new latency has brought this issue to light for us. Even overkill on the println calls slowed things down enough to connect consistently, looks like 21ms is what I should expect right now, but prior to last Friday everything worked and we haven't seen this issue before as far as I know.
Any thoughts are appreciated, my workaround at this point appears to be channel.connect(reasonableValue). I just wanted to get a further understanding of the case and put this out here in case it helped someone else keep from pulling their hair out!
Upvotes: 5
Views: 2135
Reputation: 539
Solution Found This appears to have been a casualty of the Leap Second and an apparent need for a reboot of our CentOS box (uptime 45 days). RHEL boxes were working fine for other clients, but this CentOS box was exhibiting the behavior described above.
We had validated that the appropriate kernel versions were in place prior to 12/31 and ntpdate and tzdata versions were higher than suggested to account for the Leap Second. The latest reboot had been 11/18 for Dirty Cow patching.
However, numerous patches had been auto-applied (patching level = BASIC) on 12/18 including a tzdata update (tzdata-java-2016j-1.el7.noarch). We took a chance and went the MS route and rebooted the box today. After the reboot the wait() calls behave as expected and our issue has cleared.
Upvotes: 6