Executor heartbeat timed out after 125009 ms while executing spark jobs from Dataproc cluster

Question

Below is how i'm creating my dataproc cluster, while formulating properties i'm taking care of the network timeout by assigning 3600s but despite of that the executor's heartbeat timed out after 125009ms. Why is this happening and what can be done to avoid this?

default_parallelism=512

PROPERTIES="\
spark:spark.executor.cores=2,\
spark:spark.executor.memory=8g,\
spark:spark.executor.memoryOverhead=2g,\
spark:spark.driver.memory=6g,\
spark:spark.driver.maxResultSize=6g,\
spark:spark.kryoserializer.buffer=128m,\
spark:spark.kryoserializer.buffer.max=1024m,\
spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark:spark.default.parallelism=${default_parallelism},\
spark:spark.rdd.compress=true,\
spark:spark.network.timeout=3600s,\
spark:spark.rpc.message.maxSize=256,\
spark:spark.io.compression.codec=snappy,\
spark:spark.shuffle.service.enabled=true,\
spark:spark.sql.shuffle.partitions=256,\
spark:spark.sql.files.ignoreCorruptFiles=true,\
yarn:yarn.nodemanager.resource.cpu-vcores=8,\
yarn:yarn.scheduler.minimum-allocation-vcores=2,\
yarn:yarn.scheduler.maximum-allocation-vcores=4,\
yarn:yarn.nodemanager.vmem-check-enabled=false,\
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
  "

gcloud beta dataproc clusters create $CLUSTER_NAME  \
    --zone $ZONE \
    --region $REGION \
    --master-machine-type n1-standard-4 \
    --master-boot-disk-size 500 \
    --worker-machine-type n1-standard-4 \
    --worker-boot-disk-size 500 \
    --num-workers 3 \
    --bucket $GCS_BUCKET \
    --image-version 1.4-ubuntu18 \
    --optional-components=ANACONDA,JUPYTER \
    --subnet=default \
    --enable-component-gateway \
    --scopes 'https://www.googleapis.com/auth/cloud-platform'

Below is the error i'm getting:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 11, cluster-abc-z-2.c.project_name.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 125009 ms

Gaurangi Saxena · Accepted Answer

You should be setting spark.executor.heartbeatInterval. Default value for it is 10s.

https://spark.apache.org/docs/latest/configuration.html

Executor heartbeat timed out after 125009 ms while executing spark jobs from Dataproc cluster

Answers (1)

Related Questions