The fault tolerance in Mapreduce when map worker fail

Question

Recently I was reading Google's paper, "MapReduce: Simplified Data Processing on Large Clusters". The words below confuse me. It says

When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task that has not already read the data from worker A will read the data from worker B.

I guess the wokers who executing reduce tasks are just doing what they should do. If they have read data from worker A, they can continue their tasks. Instead, if they haven't, they fail to do the task and report error to master. Then master can re-assign the reduce task to others after worker B finished. So why should they be notified of the reexecution immediately? I think it's unnecessary for some reducers who have read the data they want from worker A.

leftjoin · Accepted Answer

So why should they be notified of the reexecution immediately? I think it's unnecessary for some reducers who have read the data they want from worker A

The thing is reducers do not know that they already read all the data from mapper they want because mapper has failed and did not finish writing data.

Reducers has started reading data early before mapper completed and read some partial data. Mapper could produce more data if not failed.

Mapper has produced partial result files, then failed and new attempt has started.

Typically mappers and reducers are single-threaded and deterministic, this allows restarts and speculative execution. Suppose you do not use some non-deterministic functions like rand(), multi-threading in mapper (custom non-deterministic mapper). Also network/shuffle adds non-determinism. Mapper with multi core/multi threading can produce differently ordered output after restart. Mappers can use output of another mappers and even reducers (for example map-side join in modern implementations). The whole result should be deterministic to make it possible to restart but the order may not, it can be differently grouped files and number of files.

If reducer is commutative and also deterministic (typically yes), you can restart it and get the same result, if it is commutative, no problem with order of rows.

But is it possible to use partial results from one mapper instance (failed) and partial results from another one (new attempt), like read files 0000 - 0004 fron Map1_attempt1 and files 0005 - 0006 from Map1_attempt2 ? Only if mapper produces exactly the same number of files with the same order always. You see, if the whole result of Mapper should be deterministic, partial result may not. It depends on implementation.

The fault tolerance in Mapreduce when map worker fail

Answers (1)

Related Questions