Reputation: 167
I have trouble to get my program to run on my spark cluster. I set the cluster up with 1 master and 4 slaves. I started the master, after that, I started the slaves and they show up in the master's web ui.
I then start a small python script to check, if jobs can be executed:
from pyspark import * #SparkContext, SparkConf, spark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from files import files
import sys
if __name__ == "__main__":
appName = 'SparkExample'
masterUrl = 'spark://10.0.2.55:7077'
conf = SparkConf()
conf.setAppName(appName)
conf.setMaster(masterUrl)
conf.set("spark.driver.cores","1")
conf.set("spark.driver.memory","1g")
conf.set("spark.executor.cores","1")
conf.set("spark.executor.memory","4g")
conf.set("spark.python.worker.memory","256m")
conf.set("spark.cores.max","4")
conf.set("spark.shuffle.service.enabled","true")
conf.set("spark.dynamicAllocation.enabled","true")
conf.set("spark.dynamicAllocation.maxExecutors","1")
for k,v in conf.getAll():
print(k+":"+v)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
#spark = SparkSession.builder.master(masterUrl).appName(appName).config("spark.executor.memory","1g").getOrCreate()
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
spark.createDataFrame(l, ['name', 'age']).collect()
print("#############")
print("Test finished")
print("#############")
But as soon as I should get something back (line 45: " spark.createDataFrame(l).collect()"), spark seems to hang up. After a while, I see the message:
"WARN TaskSchedulerImpl: Initial job has not accepted any resources: check your cluster UI to ensure that workers are registered and have sufficient resources"
So I check the cluster UI:
worker-20171027105227-xx.x.x.x6-35309 10.0.2.56:35309 ALIVE 4 (0 Used) 6.8 GB (0.0 B Used)
worker-20171027110202-xx.x.x.x0-43433 10.0.2.10:43433 ALIVE 16 (1 Used) 30.4 GB (4.0 GB Used)
worker-20171027110746-xx.x.x.x5-45126 10.0.2.65:45126 ALIVE 8 (0 Used) 30.4 GB (0.0 B Used)
worker-20171027110939-xx.x.x.x4-42477 10.0.2.64:42477 ALIVE 16 (0 Used) 30.4 GB (0.0 B Used)
Looks like there are plenty of resources for the small task I created. I also see the task actually running there. When I click on it, I see, that it was launched on 5 executors and all but one EXITED. When I open the log on one of the exited ones, I see the following error message:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/27 16:45:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14443@CODA
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for TERM
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for HUP
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for INT
17/10/27 16:45:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/27 16:45:24 INFO SecurityManager: Changing view acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing view acls groups to:
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls groups to:
17/10/27 16:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, geissler); groups with view permissions: Set(); users with modify permissions: Set(root, geissler); groups with modify permissions: Set()
17/10/27 16:47:25 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
... 8 more
This looks as if the slaves cannot provide their results back to the master to me. But I don't know what to do at this point. The slaves are in the same layer of the network as the master, but on different virtual machines (not docker containers). Is there a way how I can check, if they can/cannot reach the master server? Are there any configuration settings I overlooked when setting the cluster up?
Spark version: 2.1.2 (on master, nodes and pyspark)
Upvotes: 1
Views: 478
Reputation: 167
The error here was, that the python script was executed locally. Always launch your spark scripts through spark-submit, never just run it as a normal program. Same is true for Java spark programs.
Upvotes: 0