mini
mini

Reputation: 61

MapReduce intermediate data output location

You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?

Upvotes: 4

Views: 6863

Answers (3)

user2314737
user2314737

Reputation: 29317

TaskTracker is a demon responsible for spawning map and reduce workers and it usually resides on a datanode. Map and reduce jobs run in a buffer until a certain threshold is reached; at that point records are written to disk in the background (see Memory Management in Hadoop's MapReduce tutorial). The process of writing to disk after the threshold capacity is reached is also called spill to disk. Thresholds values are given by parameters (e.g. mapreduce.task.io.sort.mb, mapreduce.map.sort.spill.percent, for Map, that can be configured).

Answer A is off because intermediate data may be written to disk.

Answers B and E can be excluded because spilled intermediate data isn't written to HDFS but to the local filesystem.

Finally, D is wrong because the question is asking for intermediate data of the Mapper’s map method. Also, it's not necessary to specify "outside HDFS" because in Hadoop context local filesystem is always understood as non-HDFS.

So, the correct answer is C.

Upvotes: 3

avinash
avinash

Reputation: 147

The mapper output is stored on a local filesystem (Not HDFS) of the tasktracker node. So your answer is option "C"

Upvotes: 1

Bhuvan
Bhuvan

Reputation: 104

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job complete

i think this is the parameter that has to be modified to change the intermediate data location

mapreduce.cluster.local.dir

Upvotes: 1

Related Questions