Chander Shivdasani
Chander Shivdasani

Reputation: 10121

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

Thanks, Chander

Upvotes: 35

Views: 29200

Answers (6)

Alok Nayak
Alok Nayak

Reputation: 2541

Sorry for being late but for future readers, yes, Chander, you are missing something.

Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.

So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:

"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

Upvotes: 4

pr-pal
pr-pal

Reputation: 3538

Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.

Like you alluded to, the mergesort (with map-reduce) would involve following steps:

  1. Partition the elements into small groups and assign each group to the mappers in round robin manner
  2. Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
  3. Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result

The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.

Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf

Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers

Upvotes: 1

rOrlig
rOrlig

Reputation: 2529

So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following

During the Map Phase (Input_Key, Input_Value) emit out (Input_Value,Input Key)

Reducer is an Identity Reduceer

So for example if our data is a student, age database then your mapper input would be ('A', 1) ('B',2) ('C', 10) ... and the output would be (1, A) (2, B) (10, C)

Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.

Upvotes: 7

SquareCog
SquareCog

Reputation: 19666

As others have mentioned, merging is much simpler than sorting, so there's a big win there.

However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.

One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.

Upvotes: 15

Peter Tillemans
Peter Tillemans

Reputation: 35341

Check out merge-sort.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

Upvotes: 25

Gopi
Gopi

Reputation: 10293

I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.

Upvotes: 1

Related Questions