David Nemeskey
David Nemeskey

Reputation: 640

How to set the number of parallel reducers on EMR?

I am running a job on EMR with mrjob; I am using AMI version 2.4.7 and Hadoop version 1.0.3.

I want to specify the number of reducers for a job, because I want to provide a higher parallellism to the next one. Reading the answers to the other questions on this site, I gathered that I should set these parameters, and so I did: mapred.reduce.tasks=576 mapred.tasktracker.reduce.tasks.maximum=24

However, it seems like the second option is not picked up: both the EMR and the Hadoop interfaces report that there are 576 reduce tasks to run, but the capacity of the cluster remains at 72 (r3.8xlarge instances).

I even see that the option is set in var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_XXX/job.xml:<property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>24</value></property>. Still, only the default number (9) of actual reducers are running at the same time.

Why is the option not picked up by EMR? Or is there a different way to force a higher number of reducers on an instance?

Upvotes: 2

Views: 2441

Answers (1)

ChristopherB
ChristopherB

Reputation: 2068

With Hadoop 1, the map and reduce slots per node are set at the daemon level and thus require a restart of the TaskTracker daemons if the value is changed.

On EMR, the default number of slots per instance type can be found at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_H1.0.3.html.

In order to change these default values you will need to use a bootstrap action like configure-hadoop to modify the mapred.tasktracker.reduce.tasks.maximum on the cluster before Hadoop daemons start. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop for more details.

Example (will need to be modified to match whatever interface is being used to create the cluster):

s3://<region>.elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.reduce.tasks.maximum=24

Please note if changing the number of slots per node be sure to adjust mapred.child.java.opts to provide an upper memory amount that is reasonable for the amount of memory available.

Upvotes: 2

Related Questions