Hadoop - How to Collect Text Output Without Values

Question

I am working on a map reduce job, and I am wondering if it is possible to emit a custom string to my output file. No counts, no other quantities, just a blob of text.

Here's the basic ideas of what Im thinking about

public static class Map extends MapReduceBase implements Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
        // this map doesn't do very much
        String line = value.toString();
        word.set(line);
        // emit to map output
        output.collect(word,one);

        // but how to i do something like output.collect(word)
        // because in my output file I want to control the text 
        // this is intended to be a map only job
    }
}

Is this kind of thing possible? This is to create a map only job to transform data, using hadoop for its parallelism, but not necessarily the whole MR framework. When I run this job I get an output file in hdfs for each mapper.

$ hadoop fs -ls /Users/dwilliams/output
2013-09-15 09:54:23.875 java[3902:1703] Unable to load realm info from SCDynamicStore
Found 12 items
-rw-r--r--   1 dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_SUCCESS
drwxr-xr-x   - dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_logs
-rw-r--r--   1 dwilliams supergroup    7223469 2013-09-15 09:52 /Users/dwilliams/output/part-00000
-rw-r--r--   1 dwilliams supergroup    7225393 2013-09-15 09:52 /Users/dwilliams/output/part-00001
-rw-r--r--   1 dwilliams supergroup    7223560 2013-09-15 09:52 /Users/dwilliams/output/part-00002
-rw-r--r--   1 dwilliams supergroup    7222830 2013-09-15 09:52 /Users/dwilliams/output/part-00003
-rw-r--r--   1 dwilliams supergroup    7224602 2013-09-15 09:52 /Users/dwilliams/output/part-00004
-rw-r--r--   1 dwilliams supergroup    7225045 2013-09-15 09:52 /Users/dwilliams/output/part-00005
-rw-r--r--   1 dwilliams supergroup    7222759 2013-09-15 09:52 /Users/dwilliams/output/part-00006
-rw-r--r--   1 dwilliams supergroup    7223617 2013-09-15 09:52 /Users/dwilliams/output/part-00007
-rw-r--r--   1 dwilliams supergroup    7223181 2013-09-15 09:52 /Users/dwilliams/output/part-00008
-rw-r--r--   1 dwilliams supergroup    7223078 2013-09-15 09:52 /Users/dwilliams/output/part-00009

How do I get the results in 1 file? Should I use the identity reducer?

Tariq · Accepted Answer

1. To achieve output.collect(word) you could make use of the Class NullWritable. To do that you have to use output.collect(word, NullWritable.get()) in your Mapper. Note that NullWritable is Singleton.

2. If you do not want to have multiple files you can set the number of reducers to 1. But this would incur additional overhead since this will involve a lot of data shuffling over the network. Reason being, the Reducer has to get its input form n different machines where Mappers were running. Also, all the load will go to just one machine. But you can definitely use one mReducer if you want just one output file. conf.setNumReduceTasks(1) should be sufficient to achieve that.

A couple of small suggestions :

I would not sugesst you to use getmerge as it copies the resulting file onto the local FS. As a result, you have to copy it back to the HDFS in order to use it further.
Use the new API if possible for you.

Hadoop - How to Collect Text Output Without Values

Answers (2)

Related Questions