sunil
sunil

Reputation: 1279

how output files(part-m-0001/part-r-0001) are created in map reduce

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.

When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.

I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.

Thanks

Upvotes: 0

Views: 4031

Answers (2)

nirnir
nirnir

Reputation: 41

The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.

Upvotes: 1

Evgeny Benediktov
Evgeny Benediktov

Reputation: 1399

Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.

It has nothing to do with the number of physical nodes in the cluster.

The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();

If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).

Upvotes: 4

Related Questions