eis
eis

Reputation: 53462

Jenkins doesn't recognize slave being down and thus does not allow for it to reconnect

We have a Jenkins instance running on Ubuntu that has several slaves in different systems. One of them is a Windows 7 host, having jenkins slave instance configured as a service.

We have a problem that when that machine is rebooted, master Jenkins doesn't realize it's gone. It looks to be just fine in the nodes view. Then, when a build is issued that is supposed to use that slave it gets stuck. If that is stopped, the next build fails immediately

Caused by: java.util.concurrent.TimeoutException: Ping started at 1457016721684 hasn't completed by 1457016961684
    ... 2 more
[EnvInject] - [ERROR] - SEVERE ERROR occurs: channel is already closed

When the slave has started up and it tries to connect back to master, connection is refused, and in the logs there is an error saying connection with that name already exists:

Server didn't accept the handshake: xxx is already connected to this master. Rejecting this connection.

There is issue JENKINS-5055 which claims a fix was committed allowing the same JNLP slave to reconnect without getting rejected, apparently this commit, and according to changelog, it was introduced in version 1.396 (2011/02/02). We are however using version 1.639 and seeing this. Somebody else seems to be seeing it as well. By looking at current codebase, I see where the error is coming from, but don't see the fix done in Jenkins-5055.

Any ideas on resolving this?

Edit: also asked on jenkins user mailing list, but no responses.

Upvotes: 3

Views: 3995

Answers (2)

eis
eis

Reputation: 53462

Reinstalling the slave on a Windows Server 2012 R2 machine shows no signs of this behavior, so it seems that either there was a mistake done during installation steps or this is something caused by using a workstation Windows version.

Regardless, here were the steps to get it working, assuming a brand new installation of Windows, with no network connectivity, and master instance using a self-signed certificate:

  1. Install JRE on the machine. If you have 64-bit operating system, install both 32-bit and 64-bit, otherwise go with 32-bit. Download link here
  2. Install .NET 3.5 on the machine. This is needed by the Jenkins service. You can follow the steps outlined by my other answer for this.
  3. Install Jenkins using Windows installer (.zipped) to C:\Jenkins. It can be downloaded from here.
  4. Check your installation is responding by navigating to http://localhost:8080 . In case of trouble, check for logs in the jenkins folder. If there is a port conflict, edit jenkins.xml and change the httpPort to something else.
  5. From the Windows computer, navigate to your master jenkins and configure a new node there.
  6. Start a slave agent using Java Launch Agent in configure -> node screen (you need to be still using your Windows slave computer)
  7. You should see a visible window opening. From there, select File -> Install as a service. (details and screenshots) If you experience an error without proper explanation, confirm .NET 3.5 is properly installed. If you see "WMI.WmiException: AccessDenied", save the jnlp file locally and start it from administrator prompt or otherwise with elevated privileges (details).
  8. From the Administrative tools -> Services, stop and disable the Jenkins service, and stop Jenkins Slave Agent but leave it on Automatic so it will start up when starting up the computer.
  9. This is only relevant if you're using a self-signed or otherwise problematic certificate:
    1. download the previously mentioned Java Launch Agent file (.jnlp file) again and save it to C:\jenkins
    2. open c:\jenkins\jenkins-slave.xml to your editor
    3. change it to refer to your local .jnlp file by changing jnlp url parameter (file:/C:/jenkins/jenkins-slave.jnlp)
    4. add -noCertificateCheck to parameters
    5. replace the -secret parameter with -auth "user:pass", since otherwise automatic url get parameters will be added which will mess finding the .jnlp file
  10. Start the Jenkins Slave Agent service again

For problems with jenkins slave service, check out jenkins-slave.err.log. For Windows Server 2012 R2, you can get the functionality of tail by using Get-Content .\jenkins-slave.err.log -Wait -Tail 10 in Powershell prompt. For older versions of Powershell, leave out -Tail 10.

Upvotes: 0

user1700494
user1700494

Reputation: 211

We faced the same issue. Used https://wiki.jenkins-ci.org/display/JENKINS/slave-status as workaround

Upvotes: 1

Related Questions