Reputation: 11629
I have some questions about map reduce output part files.
Upvotes: 0
Views: 4776
Reputation:
part-00000 is the output directories created by mappers or reducers in the Old API. In the new API it was slightly changed to part-m-* for mapper outputs and part-r-* for reducers output. For more details refer the Hadoop Definitive Guide from OReilly, page number 28.
Upvotes: 0
Reputation: 670
For old versions (< 0.2), they used to output only part-000*. But now, we see both part-m-n* (n representing number ex: part-m-00000) and part-r-n* files. part-r-n* is for output from the reducers. part-m-n* is the output from combiners. (If I don't use a combiner, I don't get any part-m-n*. I am not sure if it's a default behaviour.)
Upvotes: 1
Reputation: 5239
Normally, part-r-* comes from the reducer. MultipleOutputs
allows you to use a different naming convention. If there is no reduce step, the output will be part-m-*. As I understand it, if there is a reducer defined, the mapper outputs are deleted regardless of if the reducers produce anything. Usually the reducer output files will be produced as well even if they are empty, unless you use LazyOutputFormat
. Where did you find part-* files that did not end with either m-nnnnn or r-nnnnn ?
Upvotes: 2