Reputation: 727
Suppose I have a text file like below:
a 1
b 1
c 1
d 1
a 1
Hadoop splits the file and send records to 3 Mappers:
Mapper1: (a,1), (b,1)
Mapper2: (c,1)
Mapper3: (d,1), (a,1)
If I have only 2 Reducer, after shuffle & sort, Reducers' input like below:
Reducer1: (a, [1, 1])
Reducer2: (b, [1]), (c, [1]), (d, [1])
Question 1: Does this mean that on Reducer1, reduce
method will be invoked EXACTLY 1 time and on Reducer2, reduce
method will be invoked EXACTLY 3 times?
Question 2: For my reduce
method,
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
Does the reduce
method gets invoked only 1 time for every different key ?
Question 3: And during each invocation, does the values
parameter contains ALL records with the same key
even when there are thousands of millions records ?
Upvotes: 1
Views: 104
Reputation: 20969
Question 1: Does this mean that on Reducer1, reduce method will be invoked EXACTLY 1 time and on Reducer2, reduce method will be invoked EXACTLY 3 times?
Yes. Keep in mind that this does not hold true across reducer "attempts". So if one reducer fails, the count might vary because of the retry. But within one JVM your claim holds.
Does the reduce method gets invoked only 1 time for every different key ?
Yes.
And during each invocation, does the values parameter contains ALL records with the same key even when there are thousands of millions records ?
Yes, they are streamed though (thus the iterable). So in case of millions of records, this will be read off the local hdd.
Upvotes: 2