Reputation: 61
You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?
Upvotes: 4
Views: 6863
Reputation: 29317
TaskTracker is a demon responsible for spawning map and reduce workers and it usually resides on a datanode. Map and reduce jobs run in a buffer until a certain threshold is reached; at that point records are written to disk in the background (see Memory Management in Hadoop's MapReduce tutorial). The process of writing to disk after the threshold capacity is reached is also called spill to disk. Thresholds values are given by parameters (e.g. mapreduce.task.io.sort.mb
, mapreduce.map.sort.spill.percent
, for Map, that can be configured).
Answer A is off because intermediate data may be written to disk.
Answers B and E can be excluded because spilled intermediate data isn't written to HDFS but to the local filesystem.
Finally, D is wrong because the question is asking for intermediate data of the Mapper’s map method. Also, it's not necessary to specify "outside HDFS" because in Hadoop context local filesystem is always understood as non-HDFS.
So, the correct answer is C.
Upvotes: 3
Reputation: 147
The mapper output is stored on a local filesystem (Not HDFS) of the tasktracker node. So your answer is option "C"
Upvotes: 1
Reputation: 104
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job complete
i think this is the parameter that has to be modified to change the intermediate data location
mapreduce.cluster.local.dir
Upvotes: 1