Reputation: 803
I had such a Hadoop job. The MR has only map , not reduce. So set job.setNumReduces(0). The input files are about 300+
Then I run the job, I can see only 1 map task running. It consumes about 1 hour to finish it. Then I check the result, I can see 300+ result files in output folder.
Is there any wrong with it ? Or it is the right thing ?
I really expect that Map should equal to the num of the input file ( not 1 ). I also don't know why output file num same as input file num.
The hadoop job is submitted from oozie.
Thank you very much for your kindly help. Xinsong
Upvotes: 0
Views: 90
Reputation: 936
Number of mapper is controlled by the number of InputSplits. if you are using default FileInputFormat it will create a inputsplit for each file. so if you have 300+ input files it is expected to run 300+ map tasks. you can not explicitly controll this (number of mappers).
Since number of reducers is set to 0 all output from mappers run written to output considering output format.thats why you are getting 300+ output files.
Upvotes: 1
Reputation: 750
When you set number of reducers to 0, you the output that is generated corresponds to that generated by the map tasks alone.
There may be a large number of files being generated in the output that corresponds to the Splits on your data. Each Split of your data will spawn a new map task.
Going by the time of execution, i assume your filesize is pretty large and not 1. So it is perfectly fine for a large number of files to be generated.
Upvotes: 1