Number of splits in totalOrderPartitioner

Question

I am a little confused with this note given on Page 112 of the Book Map Reduce Design Patterns

Note that the number of ranges in the intermediate partition needs to be equal to the number of reducers in the order step. If you decide to change the number of reducers and you’ve been reusing the same file, you’ll need to rebuild it

For starters I am not entirely sure what the term "intermediate partition" means in this context, can someone explain it with an example please..

Also the book does not go on to explain the reason this is required, I am guessing that the reason is

That passing each intermediate partition to a reducer would enable the processing of all the partitions in parallel and thus be most efficient...

But for argument sake if i am okay with a the inefficiency could i put in any arbitrary number as the number of reducers ? Will it affect the final output in any way (other than performance)

you’ve been reusing the same file, you’ll need to rebuild it**

What do the above two lines mean ?

Evgeny Benediktov · Accepted Answer

Here how TotalOrderPartitioner work: first a Sampler (e.g. RandomSampler) runs (on the client) and create samples of the data. Would the data be sorted - those samples would hopefully dived it into approximately equal chunks. Second, the sorting MapReduce will use TotalOrderPartitioner with these samples to distribute data between reducers. Every reducers get its chunk data sorted and just writes it output. Since the data was distributed to reducers according to the samples, when we concatenate output of Reducers - we get the whole input data sorted.

It can be viewed as Quick Sort with one level of recursion.

To the statement, you asked about:

Note that the number of ranges in the intermediate partition needs to be equal to the number of reducers in the order step. If you decide to change the number of reducers and you’ve been reusing the same file, you’ll need to rebuild it

Intermediate partition are partitions represented by the file with samples (partition file).

The number of those "Intermediate partition" should be equal to the number of Reducers. This is a requirement. From the javadoc:

public static void setPartitionFile(Configuration conf, Path p)

// Set the path to the SequenceFile storing the sorted partition keyset. It must be the case that for R reduces, there are R-1 keys in the SequenceFile.

InputSampler creates number of "Intermediate Partitions" that is equal to the number of Reducers in your MapReduce job, satisfying this requirement.

If you re-run sorting - if you data changed slightly and the samples should still well represent it - you can use the existing partition file with the samples, as its creation on the client is expensive. But you have to use the same number of Reducers, as you used in the job for which InputSample created the partition file.

Number of splits in totalOrderPartitioner

Answers (2)

Related Questions