David Williams
David Williams

Reputation: 8654

Hadoop - How to Collect Text Output Without Values

I am working on a map reduce job, and I am wondering if it is possible to emit a custom string to my output file. No counts, no other quantities, just a blob of text.

Here's the basic ideas of what Im thinking about

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
        // this map doesn't do very much
        String line = value.toString();
        word.set(line);
        // emit to map output
        output.collect(word,one);

        // but how to i do something like output.collect(word)
        // because in my output file I want to control the text 
        // this is intended to be a map only job
    }
}

Is this kind of thing possible? This is to create a map only job to transform data, using hadoop for its parallelism, but not necessarily the whole MR framework. When I run this job I get an output file in hdfs for each mapper.

$ hadoop fs -ls /Users/dwilliams/output
2013-09-15 09:54:23.875 java[3902:1703] Unable to load realm info from SCDynamicStore
Found 12 items
-rw-r--r--   1 dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_SUCCESS
drwxr-xr-x   - dwilliams supergroup          0 2013-09-15 09:52 /Users/dwilliams/output/_logs
-rw-r--r--   1 dwilliams supergroup    7223469 2013-09-15 09:52 /Users/dwilliams/output/part-00000
-rw-r--r--   1 dwilliams supergroup    7225393 2013-09-15 09:52 /Users/dwilliams/output/part-00001
-rw-r--r--   1 dwilliams supergroup    7223560 2013-09-15 09:52 /Users/dwilliams/output/part-00002
-rw-r--r--   1 dwilliams supergroup    7222830 2013-09-15 09:52 /Users/dwilliams/output/part-00003
-rw-r--r--   1 dwilliams supergroup    7224602 2013-09-15 09:52 /Users/dwilliams/output/part-00004
-rw-r--r--   1 dwilliams supergroup    7225045 2013-09-15 09:52 /Users/dwilliams/output/part-00005
-rw-r--r--   1 dwilliams supergroup    7222759 2013-09-15 09:52 /Users/dwilliams/output/part-00006
-rw-r--r--   1 dwilliams supergroup    7223617 2013-09-15 09:52 /Users/dwilliams/output/part-00007
-rw-r--r--   1 dwilliams supergroup    7223181 2013-09-15 09:52 /Users/dwilliams/output/part-00008
-rw-r--r--   1 dwilliams supergroup    7223078 2013-09-15 09:52 /Users/dwilliams/output/part-00009

How do I get the results in 1 file? Should I use the identity reducer?

Upvotes: 2

Views: 3345

Answers (2)

Tariq
Tariq

Reputation: 34184

1. To achieve output.collect(word) you could make use of the Class NullWritable. To do that you have to use output.collect(word, NullWritable.get()) in your Mapper. Note that NullWritable is Singleton.

2. If you do not want to have multiple files you can set the number of reducers to 1. But this would incur additional overhead since this will involve a lot of data shuffling over the network. Reason being, the Reducer has to get its input form n different machines where Mappers were running. Also, all the load will go to just one machine. But you can definitely use one mReducer if you want just one output file. conf.setNumReduceTasks(1) should be sufficient to achieve that.

A couple of small suggestions :

  • I would not sugesst you to use getmerge as it copies the resulting file onto the local FS. As a result, you have to copy it back to the HDFS in order to use it further.
  • Use the new API if possible for you.

Upvotes: 4

Hari Menon
Hari Menon

Reputation: 35405

If it is a map-only job, the number of output files will be equal to the number of mappers. If reducers are needed, it will be equal to the number of reducers. But you can always do hadoop dfs -getmerge <hdfs output directory> <some file> to merge all outputs in the output directory into one file.

You can output plain text files using TextOutputFormat, like job.setOutputFormat(TextOutputFormat.class). Then change the map method above to use OutputCollector<NullWritable, Text> and output.collect(null, "some text"). This will write some text for all records. If you want tab-separated key-values, you can change it to OutputCollector<Text, Text> and output.collect("key", "some text"). This will print key<tab>some text in the output.

Upvotes: 0

Related Questions