In Mapreduce, does replication apply to intermediate data also?

Question

In Mapreduce, we say that the output produced by mappers are called intermediate data.

Are intermediate data also replicated?

Are intermediate data temporary?

When will intermediate data get deleted? Is it deleted automatically or do we need to explicitly delete it?

Ram Ghadiyaram · Accepted Answer

Mapper's spilled files are stored in the local file system of the worker node where the Mapper is running. Similarly the data streamed from one node to another node is stored in local file system of the worker node where the task is running.

This local file system path is specified by hadoop.tmp.dir property which by default is '/tmp'.

After the completion or failure of the job the temporary location used on the local file system get's cleared automatically you don't have to perform any clean up process, it's automatically handled by the framework.

In Mapreduce, does replication apply to intermediate data also?

Answers (1)

Related Questions