Reputation: 67
For example, the typical WordCount mapreduce might return an output that reads:
hello 3
world 4
again 1
I want to format the output slightly differently so that it would show this instead:
3 hello
4 world
1 again
I've read a lot of posts wanting to sort by the value and the answers suggested a second mapreduce job on the output of the first one. However, I don't need to sort by the value, and it's possible that multiple keys have the same value--I don't want them to be lumped together.
Is there an easy way to simply switch the order the key/values are printed? It seems like it should be simple.
Upvotes: 0
Views: 1181
Reputation: 13927
Two options to consider in order of ease are:
Switch the Key/Value in the Reduce
Modify the output from the reduce to switch the key and value. For example the reduce in Hadoops example WordCount job would change to:
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(result, key);
}
}
Here the context.write(result, key);
has changed to switch the key and value.
Use a second Map only job
You can use the InverseMapper
(Source) provided by Hadoop to run a Map only (0 reducers) job to switch the key and value. So you would just have a second job, and only need to write the driver, which would look something like:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Switch inputs");
job.setJarByClass(WordCount.class);
job.setMapperClass(InverseMapper.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Note, that you would want the first job to write the output of the first job using SequenceFileOutputFormat
and use SequenceFileInputFormat
as the input to the second.
Upvotes: 1