Saurabh Rana
Saurabh Rana

Reputation: 167

Deciding on the optimal number of reducers to be specified for fastest processing in a Hadoop map reduce program

The default value for the number reducers is 1.Partitioner makes sure that same keys from multiple mappers goes to the same reducer but that does not mean the number of reducers will be equal to the number of partitions. From the driver one can specify the number of reducers using JobConf's conf.setNumReduceTasks(int num) or as mapred.reduce.tasks in the command line. If only the mappers are required then then we can set this as 0.

I have read regarding setting the number of reducers that:

  1. The number of reducers can be between 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node). I also read in the below link that increasing the number for reducers can have overhead:

Number of reducers in hadoop

  1. Also, I read in the below link that the number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures):

What determines the number of mappers/reducers to use given a specified set of data

Based on the range specified in 1 and based on 2, how to decide on the optimal number for fastest processing?

Thanks.

Upvotes: 0

Views: 237

Answers (1)

Attersson
Attersson

Reputation: 4866

I want to know the approach for this in general.

This question can only have an empirical answer. Quoting from an answer in this Q&A

By default on 1 GB of data one reducer would be used. [...] Similarly if your data is 10 Gb so 10 reducer would be used .

Defaults already are the rule of thumb. You can further tweak the default number by making empirical tests and see how performances change. That is, currently, all.

Upvotes: 1

Related Questions