Reputation: 145
I created a MapReduce job that would count the number of keys and then sort them by the number of times they appeared
When dealing with an input like
1A99
1A34
1A99
1A99
1A34
1A12
The end goal would be a file like
1A99 3
1A34 2
1A12 1
My map phase outputs a <Key, 1> of types <Text, Int Writable)
My reduce phase has 3 stages: Setup where I initialize an array list to hold my <Text, Int Wrtiable), Then the reduce phase where I sum up the Int Writables to get the count and then insert that into my array, lastly the cleanup where I sort the arraylist by Count.
The values in the array list were of an object I created myObject, that hold the Text and Int Writable in a tuple, an oddity I found was when I did
new myObject(key, count)
At the end all of my keys in the array would be the same key while only the counts would differ.
If however I did
new myObject(new Text(key), count)
essentially making a copy of the key this worked.
I cant find any info on if the Key passed into the reducer from the mapper is by reference but that seems to be the only plausible explanation for why this occurs.
Upvotes: 0
Views: 172
Reputation: 4754
Understanding the actual problem is somewhat difficult without looking at actual code. However, it seems you do not need phase 1 and 3 of the reduce phase. The reducer will get a key
(Text
) and a list of values
(Iterable<IntWritable>
). This is a result of the intermediate shuffle phase that happens after the map phase. In the reduce step, you should perform whatever operation you need to do on the Iterable<IntWritable>
(in your case, adding them up). Then this means that the processing for that key is done. Using context.write(key, result_of_operation)
, output the result from reducer.
This is how the processing for your dataset will happen:
Raw data
1A99
1A34
1A99
1A99
1A34
1A12
Some number of mappers will process this, say 3:
input to mapper 1
1A99
1A34
1A99
input to mapper 2
1A99
input to mapper 3
1A34
1A12
output of mapper 1
1A99, 1
1A34, 1
1A99, 1
output of mapper 2
1A99, 1
output of mapper 3
1A34, 1
1A12, 1
intermediate shuffle phase collects all values for keys
1A99, (1,1,1)
1A34, (1, 1)
1A12, (1)
Now, say we force one reducer (though there may be more than one)
reducer 1 input and output
1A99, (1,1,1) -> 1A99, 3
1A34, (1, 1) -> 1A34, 2
1A12, (1) -> 1A12, 1
Reference that may help:
Upvotes: 0