Cleanly separate Hadoop phases

Question

I'm interested in benchmarking a Hadoop cluster at specific phases of the MapReduce execution. That is, I would like a clean separation between the map phase, the shuffle phase, and the reduce phase.

Is there a way to refrain from shuffling or reducing anything before all map tasks have finished, and refraining from reducing until all shuffles are finished? I don't care about the impact on execution time, because I'm only interested in resource consumption at each of these phases.

I saw another SO post about separating tasks on specific nodes by setting mapred.tasktracker.reduce.tasks.maximum to 0 on nodes that shouldn't reduce and mapred.tasktracker.map.tasks.maximum to 0 on nodes that shouldn't map, but in this case the map and reduce tasks still run concurrently and I'm also not able to use my full cluster for each phase.

Thanks!

Praveen Sripati · Accepted Answer

Is there a way to refrain from shuffling or reducing anything before all map tasks have finished, and refraining from reducing until all shuffles are finished?

mapreduce.job.reduce.slowstart.completedmaps is set to 0.05 and is defined as Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.

Set the above parameter to 1 and the shuffling won't start until all the map tasks execution is completed.

Cleanly separate Hadoop phases

Answers (1)

Related Questions