Dataproc cluster property(core, memory and memoryOverhead) setting

Question

I am new to dataproc and PySpark. I created a cluster with the below configuration:

gcloud beta dataproc clusters create $CLUSTER_NAME  \
    --zone $ZONE \
    --region $REGION \
    --master-machine-type n1-standard-4 \
    --master-boot-disk-size 500 \
    --worker-machine-type n1-standard-4 \
    --worker-boot-disk-size 500 \
    --num-workers 3 \
    --bucket $GCS_BUCKET \
    --image-version 1.4-ubuntu18 \
    --optional-components=ANACONDA,JUPYTER \
    --subnet=default \
    --enable-component-gateway \
    --scopes 'https://www.googleapis.com/auth/cloud-platform' \
    --properties ${PROPERTIES}

Here, are the property settings i am using currently based on what i got on the internet.

PROPERTIES="\
spark:spark.executor.cores=2,\
spark:spark.executor.memory=8g,\
spark:spark.executor.memoryOverhead=2g,\
spark:spark.driver.memory=6g,\
spark:spark.driver.maxResultSize=6g,\
spark:spark.kryoserializer.buffer=128m,\
spark:spark.kryoserializer.buffer.max=1024m,\
spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark:spark.default.parallelism=512,\
spark:spark.rdd.compress=true,\
spark:spark.network.timeout=10000000,\
spark:spark.executor.heartbeatInterval=10000000,\
spark:spark.rpc.message.maxSize=256,\
spark:spark.io.compression.codec=snappy,\
spark:spark.shuffle.service.enabled=true,\
spark:spark.sql.shuffle.partitions=256,\
spark:spark.sql.files.ignoreCorruptFiles=true,\
yarn:yarn.nodemanager.resource.cpu-vcores=8,\
yarn:yarn.scheduler.minimum-allocation-vcores=2,\
yarn:yarn.scheduler.maximum-allocation-vcores=4,\
yarn:yarn.nodemanager.vmem-check-enabled=false,\
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
  "

I want to understand if this is the right property setting for my cluster and if not how do i assign the most ideal values to these properties, specially the core, memory and memoryOverhead to run my pyspark jobs in the most efficient way possible and also because i am facing this error : Container exited with a non-zero exit code 143. Killed by external signal?

Gaurangi Saxena · Accepted Answer

It is important here to understand the configuration and limitations of the machines you are using, and how memory is allocated to spark components.

n1-standard-4 is a 4 core machine with 15GB RAM. By default, 80% of a machine's memory is allocated to YARN Node Manager. Since you are not setting it explicitly, in this case it will be 12GB.

Spark Executor and Driver run in the containers allocated by YARN.

Total memory allocated to spark executor is a sum of spark.executor.memory and spark.executor.memoryOverhead, which in this case is 10GB. I would advise you to allocate more memory to executor than to the memoryOverhead, as the former is used for running tasks and latter is used for special purposes. By default, spark.executor.memoryOverhead is max(384MB, 0.10 * executor.memory).

In this case, you can have only one executor per machine (10GB per executor and 15GB machine capacity). Because of this configuration you are underutilizing the cores because you are using only 2 cores for each executor. It is advised to leave 1 core per machine for other OS processes, so it might help to change executor.cores to 3 here.

In general it is recommended to use default memory configurations, unless you have a very good understanding of all the properties you are modifying. Based on the performance of your application under default settings, you may tweak other properties. Also consider changing to a different machine type based on the memory requirements of your application.

References - 1. https://mapr.com/blog/resource-allocation-configuration-spark-yarn/ 2. https://sujithjay.com/spark/with-yarn

Dataproc cluster property(core, memory and memoryOverhead) setting

Answers (1)

Related Questions