x_coder
x_coder

Reputation: 73

number of mapper and reducer tasks in MapReduce

If I set the number of reduce tasks as something like 100 and when I run the job, suppose the reduce task number exceeds (as per my understanding the number of reduce tasks depends on the key-value we get from the mapper.Suppose I am setting (1,abc) and (2,bcd) as key value in mapper, the number of reduce tasks will be 2) How will MapReduce handle it?.

Upvotes: 0

Views: 875

Answers (2)

Prabhu Moorthy
Prabhu Moorthy

Reputation: 177

as per my understanding the number of reduce tasks depends on the key-value we get from the mapper

Your understanding seems to be wrong. The number of reduce tasks does not depend on the key-value we get from the mapper. In a MapReduce job the number of reducers is configurable on a per job basis and is set in the driver class.

For example if we need 2 reducers for our job then we need to set it in the driver class of our MapReduce job as below:-

job.setNumReduceTasks(2);

In the Hadoop: The Definitive Guide book, Tom White states that - Setting reducer count is kind of art, instead of science.

So we have to decide how many reducers we need for our job. For your example if you have the intermediate Mapper input as (1,abc) and (2,bcd) and you have not set the number of reducers in the driver class then Mapreduce by default runs only 1 reducer and both of the key value pairs will be processed by a single Reducer and you will get a single output file in the specified output directory.

Upvotes: 1

YoungHobbit
YoungHobbit

Reputation: 13402

The default value of number of reducer on MapReduce is 1 irrespective of the number of the (key,value) pairs.

If you set the number of Reducer for a MapReduce job, then number of Reducer will not exceed than the defined value irrespective of the number of different (key,value) pairs.

Once the Mapper task are completed the output is processed by Partitioner by dividing the data into Reducers. The default partitioner for hadoop is HashPartitioner which partition the data based on hash value of keys. It has a method called getPartition. It takes key.hashCode() & Integer.MAX_VALUE and finds the modulus using the number of reduce tasks.

So the number of reducer will never exceed than what you have defined in the Driver class.

Upvotes: 0

Related Questions