Amount of space used in the mapping phase

Question

I'm new in hadoop and I started wondering: how much disk space is placed by the results of a map phase? I'm referring to the output of a map and input of the reduce.

It depends of the algorithm performed? the hadoop setup and configuration? the amount of nodes?

vefthym · Accepted Answer

It depends of the algorithm performed?

Definitely yes. Imagine a map function emitting (a, b) and another map function emitting (a,b) and (b,a). The second one emits twice as much data as the first one.

the hadoop setup and configuration?

Yes, you can set hadoop to compress map output (conf.set("mapreduce.map.output.compress", true)). Furthermore, you can choose among different compression options, like gzip, bzip2 and others. More details on choosing the correct compression option can be found here.

Furthermore, hadoop offers some variable-length format types, like VIntWritable for integers, that can save a lot of space. Varialbe-length format types take as many bytes as required to store their values, e.g. small numbers take fewer bytes than large numbers, when stored as VIntWritables.

the amount of nodes?

Here, I would say no (I am not sure, but I can't think how this can affect). It depends, however, on the number of mappers, and furthermore, on the size of input data. Imagine, for example, that you want to map as many (key, value) pairs, for each input key, as the number of mappers. If you have larger data, then you possibly have more mappers... Or more simply, that you output a (key, value) pair for every input key of the mapper. More data -> larger output.

Amount of space used in the mapping phase

Answers (1)

Related Questions