Neha Sharma
Neha Sharma

Reputation: 295

When does a mapper store its output to its local hard disk?

I know that

The output of the Mapper (intermediate data) is stored on the Local file system (not HDFS) of each individual mapper data nodes. This is typically a temporary directory which can be setup in config by the Hadoop administrator. Once the Mapper job completed or the data transferred to the Reducer, these intermediate data is cleaned up and no more accessible.

But, I wanted to know when does a mapper store its output to its local hard disk? Is it because the data is too large for it to be in memory? And only the data which is being processed remains in the memory? If the data is small and the whole data can fit in memory, then no disk involvement is there?

Can we not directly move the data, once it is processed in the mapper, from the mapper to reducer without the involvement of the hard disk of the mapper m/c. I mean as the data is being processed in the mapper, and it is in memory, once it is computed, it is directly transferred to the reducer and the mapper could pass on the next chunk of data similarly with no disk involvement.

In spark, it is said there is in-memory computation, How different is that from above? What makes spark compute in-memory better than map reduce? Also, in spark there would have to be disk involvement, if data is too huge?

Please explain

Upvotes: 3

Views: 2661

Answers (1)

Preeti Khurana
Preeti Khurana

Reputation: 370

Lot of questions here. I shall try to explain each one.

when does a mapper store it's output to its local hard disk?

The mapper stores the data in the configured memory. When the memory is 80% full(again configurable), it runs combiner on the data which is present in memory to reduce the data. But when the combined data also surpasses this memory limit, its spilled to disk. The files are known as spill files. During the whole operation, multiple spilled files are written. While writing spill files, mapper sorts and partitions the data according to the reducer. At the end of the map operations, these spill files needs to be merged.

Can we not directly move the data, once it is processed in the mapper, from the mapper to reducer without the involvement of the hard disk of the mapper m/c.

The costliest operation in any processing is the 'data transfer' between machines. The whole paradigm of map reduce is to take processing near to data rather than moving data around. So if its done the way you are suggesting, there shall be a lot of data movement. Its faster to write into local disk as compared to writing on network. This data can be reduced down by merging of the spill files. Sorting is done while spilling the files because its easier(faster) to merge sorted data. Partition is done, because you only need to merge same partitions( the data going to same reducer). In the process of merging, combiners are run again to reduce the data. This reduced data is then sent to reducers.

In spark, it is said there is in-memory computation, How different is that from above?

There is no difference in a spark and map reduce program where you just read from some data set, perform one map function and one reduce function. It will do the same reads and writes in disk as a mapreduce code. Difference comes when you need to run few operations on the same data set. In map reduce it will read from disk for each operation but in spark you have an option to use memory to store it in which case it will read from disk only once and later operations shall run on stored data in memory which is obviously way faster. Or in case when there are chains of operation where output of 1st operation is input to second. In Mapreduce, the output of 1st operation will be written to disk and be read from disk in 2nd operation, whereas in spark you can persist the output of 1st operation in memory so that 2nd operation reads from memory and shall be faster.

Upvotes: 6

Related Questions