ctkim
ctkim

Reputation: 67

How do I return the output of a Hadoop MapReduce job as value/key instead of key/value?

For example, the typical WordCount mapreduce might return an output that reads:

hello 3
world 4
again 1

I want to format the output slightly differently so that it would show this instead:

3 hello
4 world
1 again

I've read a lot of posts wanting to sort by the value and the answers suggested a second mapreduce job on the output of the first one. However, I don't need to sort by the value, and it's possible that multiple keys have the same value--I don't want them to be lumped together.

Is there an easy way to simply switch the order the key/values are printed? It seems like it should be simple.

Upvotes: 0

Views: 1181

Answers (1)

Binary Nerd
Binary Nerd

Reputation: 13927

Two options to consider in order of ease are:

Switch the Key/Value in the Reduce

Modify the output from the reduce to switch the key and value. For example the reduce in Hadoops example WordCount job would change to:

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(result, key);
    }
}

Here the context.write(result, key); has changed to switch the key and value.

Use a second Map only job

You can use the InverseMapper (Source) provided by Hadoop to run a Map only (0 reducers) job to switch the key and value. So you would just have a second job, and only need to write the driver, which would look something like:

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Switch inputs");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(InverseMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(IntWritable.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Note, that you would want the first job to write the output of the first job using SequenceFileOutputFormat and use SequenceFileInputFormat as the input to the second.

Upvotes: 1

Related Questions