Reputation: 1279
I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.
When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.
I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.
Thanks
Upvotes: 0
Views: 4031
Reputation: 41
The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.
Upvotes: 1
Reputation: 1399
Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.
It has nothing to do with the number of physical nodes in the cluster.
The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();
If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).
Upvotes: 4