how output files(part-m-0001/part-r-0001) are created in map reduce

Question

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.

When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.

I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.

Thanks

Evgeny Benediktov · Accepted Answer

Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.

It has nothing to do with the number of physical nodes in the cluster.

The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();

If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).

how output files(part-m-0001/part-r-0001) are created in map reduce

Answers (2)

Related Questions