Reputation: 25383
One of the big benefits of Hadoop MapReduce is the fact that Map processes take place on the same machine that the data they operate upon resides (to the extent possible). But can this be or is this perhaps already true of the Reduce side? For example, in the extreme case of a Map-only job, all of the output data ends up on the same machine as the corresponding input data (right?). But in an intermediate case in which the output is somewhat correlated with the output, it seems reasonable to partition the output and to the extent possible keep it on same machine at it started on.
Is this possible? Does this already happen?
Upvotes: 0
Views: 154
Reputation: 5239
If you know that the data for a particular reducer is already on the right node after the map phase, and the algorithm allows for it (see this blog post about it) you should insert your reducer as a combiner. Combiners are like miniature reducers that only get to see co-located data. Often you can dramatically improve performance because the combiner output can be orders of magnitude smaller than the map output, so what's left to shuffle is trivial.
Of course, if indeed the map phase leaves your data already correctly partitioned, why use a reducer at all? Why not create a second map job that simulates a reducer?
Upvotes: 1
Reputation: 34184
Inputs to the Reducers can reside on any node(local or remote) and not necessarily on the same machine where they are running. As Mappers complete their output gets written onto the local FS of the machine where they are running. Once this is done the intermediate output is needed by the machines that are about to run the reduce task. One thing to note here is that all the values corresponding to a particular key go the same reducer. So, it's not always possible that the input to Reducers is local, since different sets of key/value pairs are processed by different Mappers running on different machines.
Now, before the Mapper output is sent to Reducers for further processing, the data is partitioned based on keys and each partition goes to a Reducer and all the key/value pairs in that partition get processed by that Reducer. During the process a lot of data shuffling takes place. So it's not possible to maintain the data locality in case of Reducers.
Hope this answers the question.
Upvotes: 2