Steve Lorimer
Steve Lorimer

Reputation: 28659

non-deterministic connection success for recently started Google Compute Engine VM

I am using the googleapiclient python api to start a vm, and then paramiko to connect to it via ssh.

I use googleapiclient.discovery to get the GCE api

compute = googleapiclient.discovery.build('compute', 'v1')

I start my vm using the start api call

req  = compute.instances().start(project, zone, instance)
resp = request.execute()

while resp['status'] != 'DONE':
    time.sleep(1)
    resp = req.execute()

I then perform a get request to find the vm details, and in turn the ephemeral external ip address

req  = compute.instances().get(project, zone, instance)
info = req.execute()

ip_address = info['networkInterfaces'][0]['accessConfigs'][0]['natIP']

Finally, I use paramiko to connect to this ip address.

ssh_client = paramiko.SSHClient()
ssh_client.connect(ip_address)

Non-deterministically, the connect call fails:

.../lib/python3.6/site-packages/paramiko/client.py", line 362, in connect
raise NoValidConnectionsError(errors)

paramiko.ssh_exception.NoValidConnections Error: 
[Errno None] Unable to connect to port 22 on xxx.xxx.xxx.xxx

It seems to be timing related, as putting in a time.sleep(5) before the ssh_client.connect call has preventing this error.

I'm assuming this allows sufficient time for sshd to start accepting connections, but I'm not certain.

Putting sleeps in my code is uber hacky, so I'd much prefer to find a way to deterministically wait until the ssh daemon is running and available for me to connect to it (if that is indeed the cause of the NoValidConnections exception)

Alternately I see paramiko has a timeout option in the connect call - should I just change my 5 second sleep to a 5 second timeout?

Upvotes: 2

Views: 77

Answers (1)

Dan
Dan

Reputation: 7737

There’s no way for GCE to know if the guest is SSH-able. (For instance, imagine a case where the guest uses a nonstandard method for allowing remote connections, so even checking sshd wouldn’t work. Even if you could rely on sshd, the way to check that it’s running depends on its version, host OS, configuration, etc.) GCE only knows hardware-level information about the VM, such as whether it rebooted.

To solve your problem, I would try the timeout mechanism in paramiko like you described, or maybe retry the connection attempt in a loop with a timeout since paramiko might not implement a full-state-reset retry internally (just speculating, I’m not sure).

Also, I think 5 seconds may be a little low — it’s probably fine for average response time, but outliers will be slower, which could cause your connection attempts to be flaky. Maybe bump that to 30 seconds or a minute just to be totally safe.

Upvotes: 2

Related Questions