Hadoop replica processing

Question

Does Hadoop process replicas also? For example worker node i, in mapper phase, processes the data stored on that machine only. After data (not replica, but original) is finished to be processed in mapper phase or maybe not finished, can there be a case that, machine i processes replica data stored on that machine? Or replica is used only when some node does off?

suresiva · Accepted Answer

Yes, processing replicas also would happen on a specific scenario called Speculative execution.

If the machine i takes too much time to process the data block stored in that machine , then the job's Application master would start a duplicate parallel mapper against the another replica of the data block stored in a different machine. This new speculative mapper will run in the machine j where the replica is stored.

Whichever mapper completes the execution first, its outputs will be considered.The other slow running mapper and its resources will be removed.

by default, the Speculative execution is enabled. You could toggle this by modifying the below properties.

mapreduce.map.speculative
mapreduce.reduce.speculative

By any case, not more than one replica of the data block will be stored in the same machine. Every replica of the data block will be kept in different machines.

Hadoop replica processing

Answers (2)

Related Questions