Marc-9
Marc-9

Reputation: 145

Are Keys in Mapreduce pass by Value or Reference?

I created a MapReduce job that would count the number of keys and then sort them by the number of times they appeared

When dealing with an input like

1A99
1A34
1A99
1A99
1A34
1A12

The end goal would be a file like

1A99 3
1A34 2
1A12 1

My map phase outputs a <Key, 1> of types <Text, Int Writable)

My reduce phase has 3 stages: Setup where I initialize an array list to hold my <Text, Int Wrtiable), Then the reduce phase where I sum up the Int Writables to get the count and then insert that into my array, lastly the cleanup where I sort the arraylist by Count.

The values in the array list were of an object I created myObject, that hold the Text and Int Writable in a tuple, an oddity I found was when I did

new myObject(key, count)
    

At the end all of my keys in the array would be the same key while only the counts would differ.

If however I did

new myObject(new Text(key), count)

essentially making a copy of the key this worked.

I cant find any info on if the Key passed into the reducer from the mapper is by reference but that seems to be the only plausible explanation for why this occurs.

Upvotes: 0

Views: 172

Answers (1)

Jagrut Sharma
Jagrut Sharma

Reputation: 4754

Understanding the actual problem is somewhat difficult without looking at actual code. However, it seems you do not need phase 1 and 3 of the reduce phase. The reducer will get a key (Text) and a list of values (Iterable<IntWritable>). This is a result of the intermediate shuffle phase that happens after the map phase. In the reduce step, you should perform whatever operation you need to do on the Iterable<IntWritable> (in your case, adding them up). Then this means that the processing for that key is done. Using context.write(key, result_of_operation), output the result from reducer.

This is how the processing for your dataset will happen:

Raw data
1A99
1A34
1A99
1A99
1A34
1A12

Some number of mappers will process this, say 3:
input to mapper 1
1A99
1A34
1A99

input to mapper 2
1A99

input to mapper 3
1A34
1A12

output of mapper 1
1A99, 1
1A34, 1
1A99, 1

output of mapper 2
1A99, 1

output of mapper 3
1A34, 1
1A12, 1

intermediate shuffle phase collects all values for keys
1A99, (1,1,1)
1A34, (1, 1)
1A12, (1)

Now, say we force one reducer (though there may be more than one)

reducer 1 input and output
1A99, (1,1,1) -> 1A99, 3
1A34, (1, 1) -> 1A34, 2
1A12, (1) -> 1A12, 1

Reference that may help:

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0

Upvotes: 0

Related Questions