Reputation: 21
We are currently running parallel Spark jobs on an EMR cluster using HadoopActivity task from Datapipeline. By default, the newer versions of EMR clusters sets spark dynamic allocation to true which will increase/ reduce the number of executors required based on the load. So do we need to set any other property along with spark-submit e.g. number of cores, executor memory etc. or its best to have EMR cluster handle it dynamically?
Upvotes: 1
Views: 600
Reputation: 7732
This always depends of how you application is working. I can give you an good example of how I work here. For the Data Scientists in general they use the default configuration and it works pretty well due to they use Jupyter here to run their models. The only thing that we setup that can be useful for you is the conf spark.dynamicAllocation.minExecutors
this allow to setup at least two or one worker for the job. To not be without any executor. That is what we do with the Data Scientists.
But, EMR has one specific type of configuration for each type of machine you choose. So in general it is optimized for the most common activities. But sometimes you need to change according your request, if you need more memory and less cores for skewed data that is better to change.
Upvotes: 0