Reputation: 507
I'm running Spark 1.5.1 in standalone (client) mode using Pyspark. I'm trying to start a job that seems to be memory heavy (in python that is, so that should not be part of the executor-memory setting). I'm testing on a machine with 96 cores and 128 GB of RAM.
I have a master and worker running, started using the start-all.sh script in /sbin.
These are the config files I use in /conf.
spark-defaults.conf:
spark.eventLog.enabled true
spark.eventLog.dir /home/kv/Spark/spark-1.5.1-bin-hadoop2.6/logs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.enabled false
spark.deploy.
defaultCores 40
spark-env.sh:
PARK_MASTER_IP='5.153.14.30' # Will become deprecated
SPARK_MASTER_HOST='5.153.14.30'
SPARK_MASTER_PORT=7079
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_WEBUI_PORT=8081
I'm starting my script using the following command:
export SPARK_MASTER=spark://5.153.14.30:7079 #"local[*]"
spark-submit \
--master ${SPARK_MASTER} \
--num-executors 1 \
--driver-memory 20g \
--executor-memory 30g \
--executor-cores 40 \
--py-files code.zip \
<script>
Now, I'm noticing behaviour that I don't understand:
executor-cores
to over 40, my job does not get started because of not enough resources. I expect that this is because of the defaultCores 40
setting in my spark-defaults. But is't this just as a backup for when my application does not provide a maximum number of cores? I should be able to overwrite that right?Extract from the error messages I get:
Lost task 1532.0 in stage 2.0 (TID 5252, 5.153.14.30): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
... 15 more
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 111 in stage 2.0 failed 4 times, most recent failure: Lost task 111.3 in stage 2.0 (TID 5673, 5.153.14.30): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
Upvotes: 2
Views: 1945
Reputation: 73366
I expect there to be 1 executor. However, 2 executors are started
I think the one executor you see, it's actually the driver.
So one master, one slave (2 nodes in totals).
You can add to your script these configuration flags:
--conf spark.executor.cores=8 <-- will set it 8, you probably want less
--conf spark.driver.cores=8 <-- same, but for driver only
my job does not get started because of not enough resources.
I believe the container gets killed. You see, you ask for too many resources, so every container/task/core tries to take as much memory as possible, and your system can't simple give more.
The container might exceed its memory limits (you should be able to see more in the logs to be certain though).
Upvotes: 0
Reputation: 1044
Check or set the value for spark.executor.instances. The default is 2, which may explain why you get 2 executors.
Since your server has 96 cores, and you set defaultcores to 40, you only have room for 2 executors since 2*40 = 80. The remaining 16 cores are insufficient for another executor and the driver also requires CPU cores.
Upvotes: 1