pushpavanthar
pushpavanthar

Reputation: 869

Tuning Spark job in Yarn

My spark job is failing because of java.lang.OutOfMemoryError: Java heap space. I tried playing around config params like executor-cores, executor-memory, num-executors, driver-cores, driver-memory, spark.yarn.driver.memoryOverhead, spark.yarn.executor.memoryOverhead according to Ramzy's answer. Below is my configuration set

--master yarn-cluster --executor-cores 4 --executor-memory 10G --num-executors 30 --driver-cores 4 --driver-memory 16G --queue team_high --conf spark.eventLog.dir=hdfs:///spark-history --conf spark.eventLog.enabled=true --conf spark.yarn.historyServer.address=xxxxxxxxx:xxxx --conf spark.sql.tungsten.enabled=true --conf spark.ui.port=5051 --conf spark.sql.shuffle.partitions=30 --conf spark.yarn.driver.memoryOverhead=1024 --conf spark.yarn.executor.memoryOverhead=1400 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.sql.orc.filterPushdown=true --conf spark.scheduler.mode=FAIR --conf hive.exec.dynamic.partition=false --conf hive.exec.dynamic.partition.mode=nonstrict --conf mapreduce.fileoutputcommitter.algorithm.version=2 --conf orc.stripe.size=67108864 --conf hive.merge.orcfile.stripe.level=true --conf hive.merge.smallfiles.avgsize=2560000 --conf hive.merge.size.per.task=2560000 --conf spark.driver.extraJavaOptions='-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' --conf spark.executor.extraJavaOptions='-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC'

It works sometimes and fails most of the times to above-mentioned issue. While debugging, I found the below GC logs. Can someone help me understand these logs and help me tune this job?

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p"
#   Executing /bin/sh -c "kill 79911"...
Heap
 PSYoungGen      total 2330112K, used 876951K [0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 1165312K, 75% used [0x00000006eab00000,0x0000000720365f50,0x0000000731d00000)
  from space 1164800K, 0% used [0x0000000731d00000,0x0000000731d00000,0x0000000778e80000)
  to   space 1164800K, 0% used [0x0000000778e80000,0x0000000778e80000,0x00000007c0000000)
 ParOldGen       total 6990848K, used 6990706K [0x0000000540000000, 0x00000006eab00000, 0x00000006eab00000)
  object space 6990848K, 99% used [0x0000000540000000,0x00000006eaadc9c0,0x00000006eab00000)
 Metaspace       used 69711K, capacity 70498K, committed 72536K, reserved 1112064K
  class space    used 9950K, capacity 10182K, committed 10624K, reserved 1048576K
End of LogType:stdout

Upvotes: 0

Views: 807

Answers (1)

a9207
a9207

Reputation: 354

I have encountered the intermittent memory issues while running spark in cluster, and I have discovered, this happens mainly because of following reasons:-

1)Rdd partitions might just be too large to be processed, you can decrease the partition size by increasing the number of partitions by using repartition API. This will reduce the amount of data each executor will be processing. Since you have provided 10g and 4 cores to an executor that means this executor can run 4 concurrent tasks(partitions) and those 4 tasks will share 10g memory among themselves, which precisely means just 2.5g to process one partition.

val rddWithMorePartitions = rdd.repartition(rdd.getNumPartitions*2)

2)If your usecase is computation-intensive and you are not doing any caching, then you can reduce the memory allocated for storage by tweaking below parameter.

spark.storage.memoryFraction=0.6(default)

you can change it to below-

spark.storage.memoryFraction=0.5

3)You should consider increasing the executor memory to something above 25gb.

--executor-memory 26G

Upvotes: 6

Related Questions