Surender Raja
Surender Raja

Reputation: 3599

When does a Spark on YARN application exit with exitCode: -104?

My spark application reads 3 files of 7 MB , 40 MB ,100MB and so many transformations and store multiple directories

Spark version CDH1.5

MASTER_URL=yarn-cluster
NUM_EXECUTORS=15
EXECUTOR_MEMORY=4G
EXECUTOR_CORES=6
DRIVER_MEMORY=3G

My spark job was running for some time and then it throws the below error message and restarts from begining

18/03/27 18:59:44 INFO avro.AvroRelation: using snappy for Avro output
18/03/27 18:59:47 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
18/03/27 18:59:47 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.

Once it restarted again it ran for sometime and failed with this error

Application application_1521733534016_7233 failed 2 times due to AM Container for appattempt_1521733534016_7233_000002 exited with exitCode: -104
For more detailed output, check application tracking page:http://entline.com:8088/proxy/application_1521733534016_7233/Then, click on links to logs of each attempt.
Diagnostics: Container [pid=52716,containerID=container_e98_1521733534016_7233_02_000001] is running beyond physical memory limits. Current usage: 3.5 GB of 3.5 GB physical memory used; 4.3 GB of 7.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_e98_1521733534016_7233_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 52720 52716 52716 52716 (java) 89736 8182 4495249408 923677 /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg --app.conf.path --arg application.conf --arg --run_type --arg AUTO --arg --bus_date --arg 2018-03-27 --arg --code_base_id --arg EntLine-1.0-SNAPSHOT --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties
|- 52716 52714 52716 52716 (bash) 2 0 108998656 389 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/../../../CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native: /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain' --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg '--app.conf.path' --arg 'application.conf' --arg '--run_type' --arg 'AUTO' --arg '--bus_date' --arg '2018-03-27' --arg '--code_base_id' --arg 'EntLine-1.0-SNAPSHOT' --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.

As per my CDH

 Container Memory[Amount of physical memory, in MiB, that can be allocated for containers]

 yarn.nodemanager.resource.memory-mb   50655 MiB 

Please see the containers running in my driver node

enter image description here

Why are there many containers running in one node . I know that container_e98_1521733534016_7880_02_000001 is for my spark application . I don't know about other containers ? Any idea on that ? Also I see that physical memory for container_e98_1521733534016_7880_02_000001 is 3584 which is close to 3.5 GB

What does this error mean? Whe it usally occurs?

What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?

Could some one help me on this issue?

Upvotes: 5

Views: 8406

Answers (3)

Stanislav G.
Stanislav G.

Reputation: 21

For Spark or Hive or other Hadoop client, You should set bigger value then 1GB, for example: export HADOOP_CLIENT_OPTS=" -Xmx4096m" It's for exitCode: -104

Upvotes: 0

abiratsis
abiratsis

Reputation: 7316

A small addition to what @Jacek already wrote to answer the question

why you get 3.5GB instead of 3GB?

is that apart the DRIVER_MEMORY=3G you need to consider spark.driver.memoryOverhead which can be calculated as MIN(DRIVER_MEMORY * 0.10, 384)MB = 384MB + 3GB ~ 3.5GB

Upvotes: 3

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

container_e98_1521733534016_7233_02_000001 is the first container started and given MASTER_URL=yarn-cluster that's not only the ApplicationMaster, but also the driver of the Spark application.

It appears that the memory setting for the driver, i.e. DRIVER_MEMORY=3G, is too low and you have to bump it up.

Spark on YARN runs two executors by default (see --num-executors) and so you'll end up with 3 YARN containers with 000001 for the ApplicationMaster (perhaps with the driver) and 000002 and 000003 for the two executors.

What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?

Since you use yarn-cluster the driver, the ApplicationMaster and container_e98_1521733534016_7233_02_000001 are all the same and live in the same JVM. That gives that the error is about how much memory you assigned to the driver.

My understanding is that you gave DRIVER_MEMORY=3G which happened to have been too little for your processing and once YARN figured it out killed the driver (and hence the entire Spark application as it's not possible to have a Spark application up and running without the driver).

See the document Running Spark on YARN.

Upvotes: 4

Related Questions