kuang
kuang

Reputation: 727

How does Hadoop Reducer get invoked?

Suppose I have a text file like below:

a 1
b 1
c 1
d 1
a 1

Hadoop splits the file and send records to 3 Mappers:

Mapper1: (a,1), (b,1)
Mapper2: (c,1)
Mapper3: (d,1), (a,1)

If I have only 2 Reducer, after shuffle & sort, Reducers' input like below:

Reducer1: (a, [1, 1])
Reducer2: (b, [1]), (c, [1]), (d, [1])

Question 1: Does this mean that on Reducer1, reduce method will be invoked EXACTLY 1 time and on Reducer2, reduce method will be invoked EXACTLY 3 times?

Question 2: For my reduce method,

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException

Does the reduce method gets invoked only 1 time for every different key ?

Question 3: And during each invocation, does the values parameter contains ALL records with the same key even when there are thousands of millions records ?

Upvotes: 1

Views: 104

Answers (1)

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

Question 1: Does this mean that on Reducer1, reduce method will be invoked EXACTLY 1 time and on Reducer2, reduce method will be invoked EXACTLY 3 times?

Yes. Keep in mind that this does not hold true across reducer "attempts". So if one reducer fails, the count might vary because of the retry. But within one JVM your claim holds.

Does the reduce method gets invoked only 1 time for every different key ?

Yes.

And during each invocation, does the values parameter contains ALL records with the same key even when there are thousands of millions records ?

Yes, they are streamed though (thus the iterable). So in case of millions of records, this will be read off the local hdd.

Upvotes: 2

Related Questions