Reputation: 1295
Does Hadoop process replicas also? For example worker node i
, in mapper
phase, processes the data stored on that machine only. After data (not replica, but original) is finished to be processed in mapper
phase or maybe not finished, can there be a case that, machine i
processes replica data stored on that machine? Or replica is used only when some node does off?
Upvotes: 2
Views: 760
Reputation: 3173
Yes, processing replicas also would happen on a specific scenario called Speculative execution.
If the machine i
takes too much time to process the data block stored in that machine , then the job's Application master would start a duplicate parallel mapper against the another replica of the data block stored in a different machine. This new speculative mapper will run in the machine j
where the replica is stored.
Whichever mapper completes the execution first, its outputs will be considered.The other slow running mapper and its resources will be removed.
by default, the Speculative execution is enabled. You could toggle this by modifying the below properties.
mapreduce.map.speculative
mapreduce.reduce.speculative
By any case, not more than one replica of the data block will be stored in the same machine. Every replica of the data block will be kept in different machines.
Upvotes: 4
Reputation: 2221
The master node(jobtracker) may or may not pick the original data, in fact it doesn't maintain any info about out of the 3 replica which is original. Because when it saves the data it does a checksum verification on the file and saves it cleanly. Now when jobtracker wants to pick up a slot for the mapper, it takes so many things to account like number of free map slots, overhead of a tasktracker and other things. And last but not least data locality, so the closest node which satisfies almost all criteria will only be picked, it doesn't bother whether it is original or a replica and as mentioned even it doesn't maintain that identity.
Upvotes: 2