Reputation: 1321
I am trying to do a very simple setup with Spark using SSH tunneling and I can't make it work.
I have master running on my PC, with this setup ./sbin/start-master.sh -h localhost -p 7077
(if not stated otherwise, everything else is default).
On my slave PC (IP is 192.168.0.222), which is in other domain and I don't have a root access to it, I made ssh -N -L localhost:7078:localhost:7077 myMasterPCSSHalias
and run slave with ./sbin/start-slave.sh spark://localhost:7078
. I can now see this slave on the dashboard at http://localhost:8080/
in my browser. I see that it has 14GB of free memory.
When I then try e.g. this example:
./bin/spark-submit --master spark://localhost:7077 examples/src/main/python/pi.py 10
it hangs on this message until I kill it (you can see the full log message below):
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I am sure I am not using more resources than I have available, the problem still persists even though I use --executor-memory 512m
and running executor is just signalling RUNNING state. The only thing in error log is this:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:45:44 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/05/09 22:45:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:45:45 INFO SecurityManager: Changing view acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: Changing modify acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1, dan); users with modify permissions: Set(hnykdan1, dan)
and in slave log is this:
16/05/09 22:48:56 INFO Worker: Asked to launch executor app-20160509224034-0013/0 for PythonPi
16/05/09 22:48:56 INFO SecurityManager: Changing view acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: Changing modify acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1); users with modify permissions: Set(hnykdan1)
16/05/09 22:48:56 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" "-cp" "/home/hnykdan1/spark/conf/:/home/hnykdan1/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/hnykdan1/spark/lib/datanucleus-core-3.2.10.jar:/home/hnykdan1/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/hnykdan1/spark/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=37450" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:37450" "--executor-id" "0" "--hostname" "147.32.8.103" "--cores" "8" "--app-id" "app-20160509224034-0013" "--worker-url" "spark://[email protected]:54894"
Everything looks quite normal and I don't know where might be a problem. Do I need to tunnel even the other way around? It runs fine when I run slave locally in the exactly same fashion. Thanks
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:28:21 INFO SparkContext: Running Spark version 1.6.1
16/05/09 22:28:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:28:22 INFO SecurityManager: Changing view acls to: dan
16/05/09 22:28:22 INFO SecurityManager: Changing modify acls to: dan
16/05/09 22:28:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dan); users with modify permissions: Set(dan)
16/05/09 22:28:22 INFO Utils: Successfully started service 'sparkDriver' on port 34508.
16/05/09 22:28:23 INFO Slf4jLogger: Slf4jLogger started
16/05/09 22:28:23 INFO Remoting: Starting remoting
16/05/09 22:28:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:44359]
16/05/09 22:28:23 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 44359.
16/05/09 22:28:23 INFO SparkEnv: Registering MapOutputTracker
16/05/09 22:28:23 INFO SparkEnv: Registering BlockManagerMaster
16/05/09 22:28:23 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-db4c3293-423f-4966-a479-b69a90439da9
16/05/09 22:28:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/05/09 22:28:23 INFO SparkEnv: Registering OutputCommitCoordinator
16/05/09 22:28:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/05/09 22:28:24 INFO SparkUI: Started SparkUI at http://192.168.0.222:4040
16/05/09 22:28:24 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/httpd-aa031b8c-f605-41c3-aabe-fc4fe01bdcf8
16/05/09 22:28:24 INFO HttpServer: Starting HTTP Server
16/05/09 22:28:24 INFO Utils: Successfully started service 'HTTP file server' on port 41770.
16/05/09 22:28:24 INFO Utils: Copying /home/hnykdan1/spark/examples/src/main/python/pi.py to /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/userFiles-14720bed-cd41-4b15-9bd3-38dbf4f268ff/pi.py
16/05/09 22:28:24 INFO SparkContext: Added file file:/home/hnykdan1/spark/examples/src/main/python/pi.py at http://192.168.0.222:41770/files/pi.py with timestamp 1462825704629
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Connecting to master spark://localhost:7077...
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160509222824-0011
16/05/09 22:28:24 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44617.
16/05/09 22:28:24 INFO NettyBlockTransferService: Server created on 44617
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Executor added: app-20160509222824-0011/0 on worker-20160509214654-147.32.8.103-54894 (147.32.8.103:54894) with 8 cores
16/05/09 22:28:24 INFO BlockManagerMaster: Trying to register BlockManager
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160509222824-0011/0 on hostPort 147.32.8.103:54894 with 8 cores, 1024.0 MB RAM
16/05/09 22:28:24 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.222:44617 with 511.1 MB RAM, BlockManagerId(driver, 192.168.0.222, 44617)
16/05/09 22:28:24 INFO BlockManagerMaster: Registered BlockManager
16/05/09 22:28:25 INFO AppClient$ClientEndpoint: Executor updated: app-20160509222824-0011/0 is now RUNNING
16/05/09 22:28:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/05/09 22:28:25 INFO SparkContext: Starting job: reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39
16/05/09 22:28:25 INFO DAGScheduler: Got job 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39) with 10 output partitions
16/05/09 22:28:25 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:25 INFO DAGScheduler: Parents of final stage: List()
16/05/09 22:28:25 INFO DAGScheduler: Missing parents: List()
16/05/09 22:28:25 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39), which has no missing parents
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.0 KB, free 4.0 KB)
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 6.7 KB)
16/05/09 22:28:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.222:44617 (size: 2.7 KB, free: 511.1 MB)
16/05/09 22:28:26 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/05/09 22:28:26 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
16/05/09 22:28:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:28:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Upvotes: 4
Views: 10899
Reputation: 1939
It's probably related to the network (security groups rules). It's a silly test, but I just made it work by opening master and workers to all TCP traffic (inbound/outbound).
Upvotes: 0
Reputation: 2996
Since you checked that you have the resources, the next most likely problem is that the executor cannot connect back to the driver. When submitting a job, the driver starts a server that the executor will connect to in order to download the jar(s).
Yes, the error message (Initial job has not accepted any resources...
) does not look related to network problem. This is a known issue discussed for example here:
https://github.com/databricks/spark-knowledgebase/issues/9
Upvotes: 2