AKSHAY SHINGOTE
AKSHAY SHINGOTE

Reputation: 407

How Size of Driver Memory & Executor memory is allocated when using Spark on Amazon EMR

I was using AWS EMR 5.2 instance m4.2x large with 10 nodes for running my Spark applications using Spark 2.0.2. I used the property of maximizeResourceAllocation=true. I saw in spark-defaults.conf that where I saw following properties :

spark.executor.instances         10
spark.executor.cores             16
spark.driver.memory              22342M
spark.executor.memory            21527M
spark.default.parallelism        320

In yarn-site.xml,I saw yarn.nodemanager.resource.memory-mb=24576(24GB).I only understand that the spark.executor.instances set to 10 as I am using 10 nodes cluster. But can anyone explain to me how the other properties have been set like how the driver memory & the executor memory has been calculated? Also I used the property of maximizeResourceAllocation=true.How does this affects the memory?

Upvotes: 2

Views: 9559

Answers (1)

FaigB
FaigB

Reputation: 2281

I suggest the book Spark in Action. In brief, executors are containers which run tasks delivered to them by the driver. One node in a cluster could launch several executors depending on resource allocation. CPU allocation enables running tasks in parallel, so it is better to have more cores for executors. So more CPU cores means more task slots. Memory allocation for executors should be made in a sane way which should fit YARN container memory. YARN container memory >= executor memory + executor memory overhead.

Spark reserves parts of that memory for cached data storage and for temporary shuffle data. Set the heap for these with the parameters spark.storage.memoryFraction (default 0.6) and spark.shuffle.memoryFraction (default 0.2). Because these parts of the heap can grow before Spark can measure and limit them, two additional safety parameters must be set: spark.storage.safetyFraction (default 0.9) and spark.shuffle.safetyFraction (default 0.8). Safety parameters lower the memory fraction by the amount specified. The actual part of the heap used for storage by default is 0.6 × 0.9 (safety fraction times the storage memory fraction), which equals 54%. Similarly, the part of the heap used for shuffle data is 0.2 × 0.8 (safety fraction times the shuffle memory fraction), which equals 16%. You then have 30% of the heap reserved for other Java objects and resources needed to run tasks. You should, however, count on only 20%.

The Driver orchestrates stages and tasks among executors. Results from executors are returned back to the driver so the memory for the driver also should be considered to handle all data can be gathered together from all executors.

Upvotes: 6

Related Questions