Reputation: 1103
I was running an application on AWS EMR-Spark. Here, is the spark-submit job;-
Arguments : spark-submit --deploy-mode cluster --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08
AWS uses YARN for resource management. I was looking at the metrics (screenshot below), and have a doubt regarding the YARN 'container' metrics.
Here, the container allocated is shown as 2. However, I was using 4 nodes (3 slave + 1 master),all 8 cores CPU. So, how are only 2 container allocated?
Upvotes: 1
Views: 3039
Reputation: 13154
A couple of thing you need to do. First of all, you need to set the following configuration in capacity-scheduler.xml
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
otherwise YARN will not use all the cores you specify. Secondly, you need to actually specify the number of executors you need, and also the number of cores you need and the amount of memory you want allocated on executors (and possibly on the driver as well, if you either have many shuffle partitions or if you collect data to the driver).
YARN is designed to manage clusters running many different jobs at the time, so it will not per default assign all ressources to a single job, unless you force it to by setting the above mentioned setting. Furthermore, the default setting for Spark are also not sufficient for most jobs and you need to set them explicitly. Please have a read through this blog post to get a better understanding of how to tune spark settings for optimal performance.
Upvotes: 2