How does Hadoop Reducer get invoked?

Question

Suppose I have a text file like below:

a 1
b 1
c 1
d 1
a 1

Hadoop splits the file and send records to 3 Mappers:

Mapper1: (a,1), (b,1)
Mapper2: (c,1)
Mapper3: (d,1), (a,1)

If I have only 2 Reducer, after shuffle & sort, Reducers' input like below:

Reducer1: (a, [1, 1])
Reducer2: (b, [1]), (c, [1]), (d, [1])

Question 1: Does this mean that on Reducer1, reduce method will be invoked EXACTLY 1 time and on Reducer2, reduce method will be invoked EXACTLY 3 times?

Question 2: For my reduce method,

public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException

Does the reduce method gets invoked only 1 time for every different key ?

Question 3: And during each invocation, does the values parameter contains ALL records with the same key even when there are thousands of millions records ?

Thomas Jungblut · Accepted Answer

Question 1: Does this mean that on Reducer1, reduce method will be invoked EXACTLY 1 time and on Reducer2, reduce method will be invoked EXACTLY 3 times?

Yes. Keep in mind that this does not hold true across reducer "attempts". So if one reducer fails, the count might vary because of the retry. But within one JVM your claim holds.

Does the reduce method gets invoked only 1 time for every different key ?

Yes.

And during each invocation, does the values parameter contains ALL records with the same key even when there are thousands of millions records ?

Yes, they are streamed though (thus the iterable). So in case of millions of records, this will be read off the local hdd.

How does Hadoop Reducer get invoked?

Answers (1)

Related Questions