Reputation: 1433
In order to run Amplab's training exercises, I've create a keypair on us-east-1
, have installed the training scripts (git clone git://github.com/amplab/training-scripts.git -b ampcamp4
) and created the env. variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY following the instructions in http://ampcamp.berkeley.edu/big-data-mini-course/launching-a-bdas-cluster-on-ec2.html
Now running
./spark-ec2 -i ~/.ssh/myspark.pem -r us-east-1 -k myspark --copy launch try1
generates the following messages:
johndoe@ip-some-instance:~/projects/spark/training-scripts$ ./spark-ec2 -i ~/.ssh/myspark.pem -r us-east-1 -k myspark --copy launch try1
Setting up security groups...
Searching for existing cluster try1...
Latest Spark AMI: ami-19474270
Launching instances...
Launched 5 slaves in us-east-1b, regid = r-0c5e5ee3
Launched master in us-east-1b, regid = r-316060de
Waiting for instances to start up...
Waiting 120 more seconds...
Copying SSH key /home/johndoe/.ssh/myspark.pem to master...
ssh: connect to host ec2-54-90-57-174.compute-1.amazonaws.com port 22: Connection refused
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/johndoe/.ssh/myspark.pem [email protected] 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30
ssh: connect to host ec2-54-90-57-174.compute-1.amazonaws.com port 22: Connection refused
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/johndoe/.ssh/myspark.pem [email protected] 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30
...
...
subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no -i /home/johndoe/.ssh/myspark.pem [email protected] '/root/spark/bin/stop-all.sh'' returned non-zero exit status 127
where [email protected]
is the user & master instance. I've tried -u ec2-user
and increasing -w
all the way up to 600, but get the same error.
I can see the master and slave instances in us-east-1
when I log into the AWS console, and I can actually ssh into the Master instance from the 'local' ip-some-instance
shell.
My understanding is that the spark-ec2 script takes care of defining the Master/Slave security groups (which ports are listened to and so on), and I shouldn't have to tweak these settings. This said, master and slaves all listen to post 22 (Port:22, Protocol:tcp, Source:0.0.0.0/0
in the ampcamp3-slaves/masters sec. groups).
I'm at a loss here, and would appreciate any pointers before I spend all my R&D funds on EC2 instances.... Thanks.
Upvotes: 6
Views: 1844
Reputation: 13801
This is most likely caused by SSH taking a long time to start up on the instances, causing the 120 second timeout to expire before the machines could be logged into. You should be able to run
./spark-ec2 -i ~/.ssh/myspark.pem -r us-east-1 -k myspark --copy launch --resume try1
(with the --resume
flag) to continue from where things left off without re-launching new instances. This issue will be fixed in Spark 1.2.0, where we have a new mechanism that intelligently checks the SSH status rather than relying on a fixed timeout. We're also addressing the root causes behind the long SSH startup delay by building new AMIs.
Upvotes: 7