rollotommasi
rollotommasi

Reputation: 489

What's the difference between shuffle phase and combiner phase?

I'm pretty confused about the MapReduce Framework. I'm getting confused reading from different sources about that. By the way, this is my idea of a MapReduce Job

  1. Map()-->emit <key,value>
  2. Partitioner (OPTIONAL) --> divide intermediate output from mapper and assign them to different reducers
  3. Shuffle phase used to make: <key,listofvalues>
  4. Combiner, component used like a minireducer which performs some operations on data and then passes that data to the reducer. Combiner is on local not HDFS, saving space and time.
  5. Reducer, gets the data from the combiner, perform further operation (probably the same as the combiner) then releases the output.
  6. We will have n outputs parts, where n is the number of reducers

It is basically right? I mean, I found some sources stating that combiner is the shuffle phase and it basically groupby each record by key...

Upvotes: 2

Views: 4560

Answers (3)

Azim
Azim

Reputation: 1091

I don't think that combiner is a part of Shuffle and Sort phase. Combiner, itself is one of the phases(optional) of the job lifecycle.

The pipelining of these phases could be like: Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce

Out of these phases, Map, Partition and Combiner operate on the same node. Hadoop dynamically selects nodes to run Reduce Phase depend upon the availability and accessibility of the resources in best possible way. Shuffle and Sort, an important middle level phase works across the Map and Reduce nodes.

When a client submits a job, Map Phase starts working on input file which is stored across nodes in the form of blocks. Mappers process each line of the file one by one and put the result generated into some memory buffer of 100MB(local memory to each mapper). When this buffer gets filled till a certain threshold, by default 80%, this buffer is sorted and then stored into the disk(as file). Each Mapper can generate multiple such intermediate sorted splits or files. When Mapper is done with all the lines of the block, all such splits are merged together(to form a single file), sorted(on the basis of key) and then Combiner phase starts working on this single file. Note that, if there is no Paritition phase, only one intermediate file will be produced, but in case of Parititioning multiple files get generated depending upon the developers logic. Below image from Oreilly Hadoop: The Definitive guide, may help you in understanding this concept in more details.

enter image description here

Later, Hadoop copies merged file from each of the Mapper nodes to the Reducer nodes depending upon the key value. That is all the records of the same key will be copied to the same Reducer node.

I think, you may know in depth about SS and Reduce Phase work, so not going into more details for these topics.

Also, for more information, I would suggest you to read Oreilly Hadoop: The Definitive guide. Its awesome book for Hadoop.

Upvotes: 1

vefthym
vefthym

Reputation: 7462

Combiner is NOT at all similar to the shuffling phase. What you describe as shuffling is wrong, which is the root of your confusion.

Shuffling is just copying keys from map to reduce, it has nothing to do with key generation. It is the first phase of a Reducer, with the other two being sorting and then reducing.

Combining is like executing a reducer locally, for the output of each mapper. It basically acts like a reducer (it also extends the Reducer class), which means that, like a reducer, it groups the local values that the mapper has emitted for the same key.

Partitioning is, indeed, assigning the map output keys to specific reduce tasks, but it is not optional. Overriding the default HashPartitioner with an implementation of your own is optional.

I tried to keep this answer minimal, but you can find more information on the book Hadoop: The Definitive Guide by Tom White, as Azim suggests, and some related things in this post.

Upvotes: 5

Vijay Bhoomireddy
Vijay Bhoomireddy

Reputation: 576

Think of combiner as a mini-reducer phase that only works on the output of map task within each node before it emits it to the actual reducer.

Taking the classical WordCount example, map phase output would be (word,1) for each word the map task processes. Lets assume the input to be processed is

"She lived in a big house with a big garage on the outskirts of a big city in India"

Without a combiner, map phase would emit (big,1) three times and (a,1) three times and (in,1) two times. But when a combiner is used, the map phase would emit (big,3), (a,3) and (in,2). Note that the individual occurrences of each of these words is aggregated locally within the map phase before it emits its output to reduce phase. In use cases where Combiner is used, it would optimise to ensure network traffic from map to reduce is minimised due to local aggregation.

During the shuffle phase, output from various map phases are redirected to the correct reducer phase. This is handled internally by the framework. If a partitioner is used, it would be helpful to shuffle the input to reduce accordingly.

Upvotes: 3

Related Questions