amazon-web-servicesamazon-ec2apache-sparkapache-spark-1.2

Reputation: 1567

Cluster hangs in 'ssh-ready' state using Spark 1.2.0 EC2 launch script

I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k <key-pair> -i <identity-file>.pem -r us-west-2 -s 3 launch test
Setting up security groups...
Searching for existing cluster test...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 3 slaves in us-west-2c, regid = r-b_______6
Launched master in us-west-2c, regid = r-0______0
Waiting for all instances in cluster to enter 'ssh-ready' state..........

Yet I can SSH into these instances without complaint:

ubuntu@machine:~$ ssh -i <identity-file>.pem root@master-ip
Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB-CCCC-DDD.eee1.ff.provider.net

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
There are 59 security update(s) out of 257 total update(s) available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2014.09 is available.
root@ip-internal ~]$

I'm trying to figure out if this is a problem in AWS or with the Spark scripts. I've never had this issue before until recently.

Upvotes: 5

Answers (4)

Greg Dubicki

Reputation: 6940

Spark 1.3.0+

This issue is fixed in Spark 1.3.0.

Spark 1.2.0

Your problem is caused by SSH silently stopping because of conflicting entries in you SSHs known_hosts file.

To resolve your issue add -o UserKnownHostsFile=/dev/null to your spark_ec2.py script like this.

Optionally, to clean up and avoid running into problems with connecting to your cluster with SSH later on I recommend you to:

Remove all the lines from ~/.ssh/known_hosts that include EC2 hosts, for example:

ec2-54-154-27-180.eu-west-1.compute.amazonaws.com,54.154.27.180 ssh-rsa (...)

Use this solution to stop checking and storing the fingerprints of temporary IP of your EC2 instances at all

Upvotes: 4

nmurthy

Reputation: 1567

I used the absolute (not relative) path to my identity file (inspired by Peter Zybrick) and did everything Grzegorz Dubicki suggested. Thank you.

Upvotes: 1

spar128

Reputation: 21

I had the same problem and I followed all the steps mentioned in the thread (mainly adding -o UserKnownHostsFile=/dev/null to your spark_ec2.py script), still it was hanging saying

Waiting for all instances in cluster to enter 'ssh-ready' state

Short answer:

Change permission of the private key file and rerun the spark-ec2 script

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% chmod 0400 /tmp/mykey.pem

Long Answer:

To troubleshoot, I modified spark_ec2.py and logged the the ssh command used and tried to execute it on command prompt, it was the bad permission on the key:

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/mykey.pem  -o ConnectTimeout=3 [email protected] 
Warning: Permanently added '52.1.208.72' (RSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '/tmp/mykey.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /tmp/mykey.pem
Permission denied (publickey).

Upvotes: 2

Pete Zybrick

Reputation: 19

I just ran into the same exact situation. I went into the python script at def is_ssh_available() and had it dump out the return code and cmd.

except subprocess.CalledProcessError, e:
print "CalledProcessError "
print e.returncode
print e.cmd

I had the key file location as ~/.pzkeys/mykey.pem - as an experiment, I changed it to fully qualified - i.e. /home/pete.zybrick/.pzkeys/mykey.pem and that worked ok.

Right after that, I ran into another error - I tried to use --user=ec2-user (I try to avoid using root), then I got a permission error on rsync, removed the --user-ec2-user so it would use root as default, did another attempt with --resume, ran to successful completion.