Reputation: 326
As far as I know, Mapper output will be stored in the node where mapper was executed.
So, When I am processing 1 TB of data, let's say that the total number of mappers are 1000. At first, it executes 500 mappers and stores the output in local and starts executing the remaining number of mappers. After that, it shuffles data to reducer and starts with the reducer process.
Question:
Will that data node store all the mapper output that gets executed in that node? If so, will it store 1 TB or .75 Tb(after compression) of data in local before sending the data to reducer?
Upvotes: 0
Views: 1769
Reputation: 7462
I am not sure if I got your question correctly (please rephrase), but I guess you are asking what happens when the output of a mapper is too big to fit in its local disk (yes, it is stored locally, not on HDFS). See this related post and this one. Actually, it is first written in a buffer in memory, and when this buffer is full, it is spilled to disk. I also found this document, that explains the process in a nice and intuitive way.
If the output is larger than what can fit in the local disk of the node, then, the task will fail, giving you a "No space left on device" error and Hadoop will try to send it to another node. If the second attempt also fails, it will send it to another node, until a predefined number n of task attempts have failed.
Then, if a number m of tasks have failed, your job will fail as well.
However, I am not sure why you imply that the whole input (1TB) will be processed by one node. Usually, it is split into many chunks that will be processed by different nodes (unless you only have one node in your cluster).
Upvotes: 1