Reputation: 2813
I've been setting up a Spark standalone cluster setup following this link. I have 2 machines; The first one (ubuntu0) serve as both the master and a worker, and the second one (ubuntu1) is just a worker. Password-less ssh has been properly configured for both machines already and was tested by doing SSH manually on both sides.
Now when I tried to ./start-all.ssh, both master and worker on the master machine (ubuntu0) were started properly. This is signified by (1)WebUI being accessible (localhost:8081 on my part) and (2) Worker registered/displayed on the WebUI. However, the other worker on the second machine (ubuntu1), was not started. The error displayed was:
ubuntu1: ssh: connect to host ubuntu1 port 22: Connection timed out
Now this is quite weird already given that I've properly configured the ssh to be password-less on both sides. Given this, I accessed the second machine and tried to start the worker manually using these commands:
./spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu0:7707
./spark-class org.apache.spark.deploy.worker.Worker spark://<ip>:7707
However, below is the result:
14/05/23 13:49:08 INFO Utils: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/05/23 13:49:08 WARN Utils: Your hostname, ubuntu1 resolves to a loopback address:
127.0.1.1; using 192.168.122.1 instead (on interface virbr0)
14/05/23 13:49:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
14/05/23 13:49:09 INFO Slf4jLogger: Slf4jLogger started
14/05/23 13:49:09 INFO Remoting: Starting remoting
14/05/23 13:49:09 INFO Remoting: Remoting started; listening on addresses :
[akka.tcp://[email protected]:42739]
14/05/23 13:49:09 INFO Worker: Starting Spark worker ubuntu1.local:42739 with 8 cores,
4.8 GB RAM
14/05/23 13:49:09 INFO Worker: Spark home: /home/ubuntu1/jaysonp/spark/spark-0.9.1
14/05/23 13:49:09 INFO WorkerWebUI: Started Worker web UI at http://ubuntu1.local:8081
14/05/23 13:49:09 INFO Worker: Connecting to master spark://ubuntu0:7707...
14/05/23 13:49:29 INFO Worker: Connecting to master spark://ubuntu0:7707...
14/05/23 13:49:49 INFO Worker: Connecting to master spark://ubuntu0:7707...
14/05/23 13:50:09 ERROR Worker: All masters are unresponsive! Giving up.
Below are the contents of my master and slave\worker spark-env.ssh:
SPARK_MASTER_IP=192.168.3.222
STANDALONE_SPARK_MASTER_HOST=`hostname -f`
How should I resolve this? Thanks in advance!
Upvotes: 3
Views: 10534
Reputation: 7085
I have add similar issues today running spark 1.5.1 on RHEL 6.7. I have 2 machines, their hostname being - master.domain.com - slave.domain.com
I installed a standalone version of spark (pre-build against hadoop 2.6) and installed my Oracle jdk-8u66.
Spark download:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz
Java download
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u66-b17/jdk-8u66-linux-x64.tar.gz"
after spark and java are unpacked in my home directory I did the following:
on 'master.domain.com' I ran:
./sbin/start-master.sh
The webUI become available at http://master.domain.com:8080 (no slave running)
on 'slave.domain.com' I did try:
./sbin/start-slave.sh spark://master.domain.com:7077
FAILED AS FOLLOW
Spark Command: /root/java/bin/java -cp /root/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/root/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://master.domain.com:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/06 11:03:51 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/11/06 11:03:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/06 11:03:51 INFO SecurityManager: Changing view acls to: root
15/11/06 11:03:51 INFO SecurityManager: Changing modify acls to: root
15/11/06 11:03:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/11/06 11:03:52 INFO Slf4jLogger: Slf4jLogger started
15/11/06 11:03:52 INFO Remoting: Starting remoting
15/11/06 11:03:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:50573]
15/11/06 11:03:52 INFO Utils: Successfully started service 'sparkWorker' on port 50573.
15/11/06 11:03:52 INFO Worker: Starting Spark worker 10.80.70.38:50573 with 8 cores, 6.7 GB RAM
15/11/06 11:03:52 INFO Worker: Running Spark version 1.5.1
15/11/06 11:03:52 INFO Worker: Spark home: /root/spark-1.5.1-bin-hadoop2.6
15/11/06 11:03:53 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/11/06 11:03:53 INFO WorkerWebUI: Started WorkerWebUI at http://10.80.70.38:8081
15/11/06 11:03:53 INFO Worker: Connecting to master master.domain.com:7077...
15/11/06 11:04:05 INFO Worker: Retrying connection to master (attempt # 1)
15/11/06 11:04:05 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-4,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@48711bf5 rejected from java.util.concurrent.ThreadPoolExecutor@14db705b[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:211)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:210)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters(Worker.scala:210)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:288)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/11/06 11:04:05 INFO ShutdownHookManager: Shutdown hook called
start-slave spark://<master-IP>:7077
also FAILED as above.
start-slave spark://master:7077
WORKED and the worker shows in the master webUI
Spark Command: /root/java/bin/java -cp /root/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/root/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://master:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/06 11:08:15 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/11/06 11:08:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/06 11:08:15 INFO SecurityManager: Changing view acls to: root
15/11/06 11:08:15 INFO SecurityManager: Changing modify acls to: root
15/11/06 11:08:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/11/06 11:08:16 INFO Slf4jLogger: Slf4jLogger started
15/11/06 11:08:16 INFO Remoting: Starting remoting
15/11/06 11:08:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:40780]
15/11/06 11:08:17 INFO Utils: Successfully started service 'sparkWorker' on port 40780.
15/11/06 11:08:17 INFO Worker: Starting Spark worker 10.80.70.38:40780 with 8 cores, 6.7 GB RAM
15/11/06 11:08:17 INFO Worker: Running Spark version 1.5.1
15/11/06 11:08:17 INFO Worker: Spark home: /root/spark-1.5.1-bin-hadoop2.6
15/11/06 11:08:17 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/11/06 11:08:17 INFO WorkerWebUI: Started WorkerWebUI at http://10.80.70.38:8081
15/11/06 11:08:17 INFO Worker: Connecting to master master:7077...
15/11/06 11:08:17 INFO Worker: Successfully registered with master spark://master:7077
Note: I haven't added any extra config in conf/spark-env.sh
Note2: when looking in the master webUI, the spark master URL at the top is actually the one that worked for me, so I'd say in doubts just use that one.
I hope this helps ;)
Upvotes: 2
Reputation: 33
I guess you missed something in your configuration, that's what I learned from your log.
/etc/hosts
, make sure ubuntu1
in your master's host list and its Ip is match the slave's IP address.export SPARK_LOCAL_IP='ubuntu1'
in the spark-env.sh
file of your slaveUpvotes: 0
Reputation: 577
Using hostname in /cong/slaves worked well for me. Here are some steps I would take it,
My part of setting in spark-env.sh
export STANDALONE_SPARK_MASTER_HOST=
hostname
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
Upvotes: 0
Reputation: 2813
For those who are still encountering error(s) when it comes to starting workers on different machines, I just want to share that using IP addresses in conf/slaves worked for me. Hope this helps!
Upvotes: 3