Reputation:
Background
I have been battling with Apache Spark and have worked out most errors except one. I have a Master and one Slave. I can start the master via
./sbin/start-master.sh
and then I can connect to it from the slave by
JAVA_OPTS="-Xmx10g" ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://10.17.16.43:7077
I then see the success message
14/08/25 08:47:04 INFO worker.Worker: Successfully registered with master spark://10.17.16.43:7077
All of these errors are repeatable (I have been at this for a while). I can telnet into the master from the slave just fine as is mentioned in most other tutorials. SSH is configured to not need passwords between master and slave (RSA keys) as mentioned elsewhere.
I have spark/conf/spark-env.sh set to the following. There are more lines that are commented out
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/mnt/spark,/mnt2/spark -Dspark.akka.logLifecycleEvents=true"
export SPARK_LOCAL_IP=`ifconfig | sed -En 's/127.0.0.1//;s/.*inet (addr:)?(([0-9]*\.){3}[0-9]*).*/\2/p' | head -1`
export SPARK_MASTER_IP=$SPARK_LOCAL_IP
export SPARK_MASTER_WEBUI_PORT=8090
export SPARK_WORKER_CORES=1
I pulled those from various tutorials in hope that they would fix something.
Here is my master /etc/hosts
127.0.0.1 localhost
10.17.16.43 aidan-workstation
10.17.16.49 ubuntu
And slave
127.0.0.1 localhost
10.17.16.49 ubuntu
10.17.16.43 aidan-workstation
The Error
When I run ./bin/spark-shell
I get the following in the master terminal ( just posted the tail end of it the full output is here )
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor added: app-20140825085822-0002/8 on worker-20140825084704-ubuntu-49237 (ubuntu:49237) with 8 cores
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140825085822-0002/8 on hostPort ubuntu:49237 with 8 cores, 512.0 MB RAM
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/8 is now RUNNING
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/8 is now FAILED (Command exited with code 1)
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140825085822-0002/8 removed: Command exited with code 1
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor added: app-20140825085822-0002/9 on worker-20140825084704-ubuntu-49237 (ubuntu:49237) with 8 cores
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140825085822-0002/9 on hostPort ubuntu:49237 with 8 cores, 512.0 MB RAM
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/9 is now RUNNING
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/9 is now FAILED (Command exited with code 1)
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140825085822-0002/9 removed: Command exited with code 1
14/08/25 08:58:25 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/08/25 08:58:25 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
And at the same time the slave outputs (tail as well full output is here as well)
14/08/25 09:04:18 INFO worker.ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" ":/home/hduser/spark/conf:/home/hduser/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.2-hadoop2.2.0.jar:/home/hduser/hadoop/etc/hadoop" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler" "7" "ubuntu" "8" "akka.tcp://sparkWorker@ubuntu:55553/user/Worker" "app-20140825090434-0003"
14/08/25 09:04:18 INFO worker.Worker: Executor app-20140825090434-0003/7 finished with state FAILED message Command exited with code 1 exitStatus 1
14/08/25 09:04:18 INFO worker.Worker: Asked to launch executor app-20140825090434-0003/8 for Spark shell
14/08/25 09:04:18 INFO worker.ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" ":/home/hduser/spark/conf:/home/hduser/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.2-hadoop2.2.0.jar:/home/hduser/hadoop/etc/hadoop" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler" "8" "ubuntu" "8" "akka.tcp://sparkWorker@ubuntu:55553/user/Worker" "app-20140825090434-0003"
14/08/25 09:04:19 INFO worker.Worker: Executor app-20140825090434-0003/8 finished with state FAILED message Command exited with code 1 exitStatus 1
14/08/25 09:04:19 INFO worker.Worker: Asked to launch executor app-20140825090434-0003/9 for Spark shell
14/08/25 09:04:19 INFO worker.ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" ":/home/hduser/spark/conf:/home/hduser/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.2-hadoop2.2.0.jar:/home/hduser/hadoop/etc/hadoop" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler" "9" "ubuntu" "8" "akka.tcp://sparkWorker@ubuntu:55553/user/Worker" "app-20140825090434-0003"
14/08/25 09:04:19 INFO worker.Worker: Executor app-20140825090434-0003/9 finished with state FAILED message Command exited with code 1 exitStatus 1
You may notice that the times are off. This is my fault. I had to re run the programs at different times to get a clean output. This is not due to the program.
What I want
How can I connect my master and slave such that I can run Scala programs on a distributed system?
Upvotes: 12
Views: 2613
Reputation: 15879
I note from your logs that akka is using a simple hostname aidan-workstation
rather than a fully qualified domain name like aidan-workstation.acme.com
akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler
akka.tcp://sparkWorker@ubuntu:55553/user/Worker
From this user post it "may" be the issue you're having
I had to set SPARK_MASTER_IP in conf/start-master.sh to hostname -f instead of hostname, since akka seems not to work properly with host names / ip, it requires fully qualified domain names.
You can try editing your hosts file to include a faked domain name.
Upvotes: 1