Reputation: 2370
My question is related to this post Sorting large data using MapReduce/Hadoop. My idea of sorting an arbitrarily set is:
Then the following processing is done:
To take advantage of the locality of data, new Reducer tasks could be scheduled to merge several output files produced by the previous reducer task. So for example if K=5, first reducer task would produce files of size 50000 and the new reducer task would work with 5 files of 50000 sorted records each. New Reducer jobs would be scheduled until only one file remains, in this case of size 250.000.000 (because R=4). Finally a new Reducer job would be scheduled on another machine to merge the files into a single 10^9 file
My Question: is it possible in Hadoop to schedule the execution of Reducer jobs in such a way that they merge the files in some directory until only 1 file remains? If yes, how?
Another scenario would be to schedule the MapReduce jobs after each merge step, so for example the files of size 50000 would be merged in parallel by reduce tasks running on other machines, then files of size 250.000 on yet other machines, etc. but this would generate a lot of network traffic. In any case the question remains valid for this case also - how to chain several MapReduce jobs such that the chaining stops after only 1 resulting file is output?
Upvotes: 0
Views: 3149
Reputation: 25909
Hadoop sorting is done with a partitioner. see for example the source code for the terasort benchmark
Upvotes: 1