Reputation: 8664
I am a little confused with this note given on Page 112 of the Book Map Reduce Design Patterns
Note that the number of ranges in the intermediate partition needs to be equal to the number of reducers in the order step. If you decide to change the number of reducers and you’ve been reusing the same file, you’ll need to rebuild it
For starters I am not entirely sure what the term "intermediate partition" means in this context, can someone explain it with an example please..
Also the book does not go on to explain the reason this is required, I am guessing that the reason is
That passing each intermediate partition to a reducer would enable the processing of all the partitions in parallel and thus be most efficient...
But for argument sake if i am okay with a the inefficiency could i put in any arbitrary number as the number of reducers ? Will it affect the final output in any way (other than performance)
you’ve been reusing the same file, you’ll need to rebuild it**
What do the above two lines mean ?
Upvotes: 0
Views: 482
Reputation: 1399
Here how TotalOrderPartitioner work: first a Sampler (e.g. RandomSampler) runs (on the client) and create samples of the data. Would the data be sorted - those samples would hopefully dived it into approximately equal chunks. Second, the sorting MapReduce will use TotalOrderPartitioner with these samples to distribute data between reducers. Every reducers get its chunk data sorted and just writes it output. Since the data was distributed to reducers according to the samples, when we concatenate output of Reducers - we get the whole input data sorted.
It can be viewed as Quick Sort with one level of recursion.
To the statement, you asked about:
Note that the number of ranges in the intermediate partition needs to be equal to the number of reducers in the order step. If you decide to change the number of reducers and you’ve been reusing the same file, you’ll need to rebuild it
Intermediate partition are partitions represented by the file with samples (partition file).
The number of those "Intermediate partition" should be equal to the number of Reducers. This is a requirement. From the javadoc:
public static void setPartitionFile(Configuration conf, Path p)
// Set the path to the SequenceFile storing the sorted partition keyset. It must be the case that for R reduces, there are R-1 keys in the SequenceFile.
InputSampler creates number of "Intermediate Partitions" that is equal to the number of Reducers in your MapReduce job, satisfying this requirement.
If you re-run sorting - if you data changed slightly and the samples should still well represent it - you can use the existing partition file with the samples, as its creation on the client is expensive. But you have to use the same number of Reducers, as you used in the job for which InputSample created the partition file.
Upvotes: 4
Reputation: 58
I think "intermediate partition" refers to the intermediate partition file that you build during the analyze phase. This file determines the range partitions for each reducer. As you want the ranges to be of equal size you first analyze a sample of your data and then construct the ranges accordingly. You also specify the number of ranges.
As each reducer in the order phase receives one partition, the number of reducers needs to be equal to the number of ranges in that partition file. If you have 5 partitions in your file but use only 4 reducers, hadoop won't know where to send the fifth partition. In that case you would need to rebuild your partition file such that it only contains 4 ranges.
Upvotes: 0