How to sort an arbitrarily large set of data using Hadoop?

Question

My question is related to this post Sorting large data using MapReduce/Hadoop. My idea of sorting an arbitrarily set is:

We have a large file with records, say 10^9 records.
The file is split across M mappers. Each mapper sorts a split of size, say 10000 records using QuickSort and outputs that sorted subsequence. The output key ranges between 1 and R, where R is the number of reducer tasks (suppose R = 4). The value is the sorted subsequence.
Each Reducer reads K subsequences and merges them (taking the smallest element from subsequences iteratively until subsequences are empty). The output is written to a file.

Then the following processing is done:

To take advantage of the locality of data, new Reducer tasks could be scheduled to merge several output files produced by the previous reducer task. So for example if K=5, first reducer task would produce files of size 50000 and the new reducer task would work with 5 files of 50000 sorted records each. New Reducer jobs would be scheduled until only one file remains, in this case of size 250.000.000 (because R=4). Finally a new Reducer job would be scheduled on another machine to merge the files into a single 10^9 file

My Question: is it possible in Hadoop to schedule the execution of Reducer jobs in such a way that they merge the files in some directory until only 1 file remains? If yes, how?

Another scenario would be to schedule the MapReduce jobs after each merge step, so for example the files of size 50000 would be merged in parallel by reduce tasks running on other machines, then files of size 250.000 on yet other machines, etc. but this would generate a lot of network traffic. In any case the question remains valid for this case also - how to chain several MapReduce jobs such that the chaining stops after only 1 resulting file is output?

Arnon Rotem-Gal-Oz · Accepted Answer

Hadoop sorting is done with a partitioner. see for example the source code for the terasort benchmark

How to sort an arbitrarily large set of data using Hadoop?

Answers (1)

Related Questions