Reputation: 89
Is it possible to split the output of a mapreduce job into multiple files instead of a single 'part-r-00000' file?
I've came across the MultipleOutputFormat class, but from what I've read it seems that it only breaks the output into files based on the key. MultipleOutputFormat
What I am looking for is, to take the WordCount job as an example, to just divide the output into more than 1 file.
Upvotes: 1
Views: 3074
Reputation: 3433
I have a similar problem regarding Wordcount. In my case I need to write words starts with each letter into separate files.Here I used MultipleOutputs
.
public class NameCountReducer extends Reducer<Text, NameCountTuple, Text, NameCountTuple> {
private NameCountTuple result = null;
private MultipleOutputs<Text,NameCountTuple> out;
public void setup(Context context) {
out = new MultipleOutputs<Text,NameCountTuple>(context);
}
public void reduce(Text key, Iterable<NameCountTuple> values, Context context)
throws IOException, InterruptedException {
int count = 0;
for (HITuple val : values) {
count += val.getCount();
}
result.setCount(count);
out.write(key, result,"outputpath/"+key.getText().charAt(0));
}
public void cleanup(Context context) throws IOException,InterruptedException {
out.close();
}
}
Here it gives output in the following paths as
outputpath/a
/b
/c
.......
For this you should use LazyOutputFormat.setOutputFormatClass()
instead of FileOutputFormat
. Also need to add job configuration as job.setOutputFormatClass(NullOutputFormat.class)
Upvotes: 3
Reputation: 89
Thank you all for the above suggestions.
The MapReduce Job I have is actually just a simple search job, with the map tasks extracting the input lines that matches a certain condition. And then to simply output that result without going thru any reduce tasks.
Initially i did not set the reduce tasks number and from the output logs i could see that it defaults as 1. I tried to set to a higher number but somehow it does produce multiple output files (part-000xx) but only one of the output files will carry all the results while the rest are just empty files.
Then when I set this below, it worked. With each reduce tasks output being the final output file. I'm not really sure if this is the correct way to do it but I'll take it for now as a workaround
conf.set("mapred.reduce.tasks", "0")
Upvotes: -1
Reputation: 5239
Forgive me, but typically you get as many part-r-nnnnn files as you have reducer tasks. If the word count example has only one reducer configured, all you have to do is configure more than one (mapred.reduce.tasks or the Hadoop 2 equivalent).
Upvotes: 1