Reputation: 8654
I am working on a map reduce job, and I am wondering if it is possible to emit a custom string to my output file. No counts, no other quantities, just a blob of text.
Here's the basic ideas of what Im thinking about
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
// this map doesn't do very much
String line = value.toString();
word.set(line);
// emit to map output
output.collect(word,one);
// but how to i do something like output.collect(word)
// because in my output file I want to control the text
// this is intended to be a map only job
}
}
Is this kind of thing possible? This is to create a map only job to transform data, using hadoop for its parallelism, but not necessarily the whole MR framework. When I run this job I get an output file in hdfs for each mapper.
$ hadoop fs -ls /Users/dwilliams/output
2013-09-15 09:54:23.875 java[3902:1703] Unable to load realm info from SCDynamicStore
Found 12 items
-rw-r--r-- 1 dwilliams supergroup 0 2013-09-15 09:52 /Users/dwilliams/output/_SUCCESS
drwxr-xr-x - dwilliams supergroup 0 2013-09-15 09:52 /Users/dwilliams/output/_logs
-rw-r--r-- 1 dwilliams supergroup 7223469 2013-09-15 09:52 /Users/dwilliams/output/part-00000
-rw-r--r-- 1 dwilliams supergroup 7225393 2013-09-15 09:52 /Users/dwilliams/output/part-00001
-rw-r--r-- 1 dwilliams supergroup 7223560 2013-09-15 09:52 /Users/dwilliams/output/part-00002
-rw-r--r-- 1 dwilliams supergroup 7222830 2013-09-15 09:52 /Users/dwilliams/output/part-00003
-rw-r--r-- 1 dwilliams supergroup 7224602 2013-09-15 09:52 /Users/dwilliams/output/part-00004
-rw-r--r-- 1 dwilliams supergroup 7225045 2013-09-15 09:52 /Users/dwilliams/output/part-00005
-rw-r--r-- 1 dwilliams supergroup 7222759 2013-09-15 09:52 /Users/dwilliams/output/part-00006
-rw-r--r-- 1 dwilliams supergroup 7223617 2013-09-15 09:52 /Users/dwilliams/output/part-00007
-rw-r--r-- 1 dwilliams supergroup 7223181 2013-09-15 09:52 /Users/dwilliams/output/part-00008
-rw-r--r-- 1 dwilliams supergroup 7223078 2013-09-15 09:52 /Users/dwilliams/output/part-00009
How do I get the results in 1 file? Should I use the identity reducer?
Upvotes: 2
Views: 3345
Reputation: 34184
1. To achieve output.collect(word) you could make use of the Class NullWritable. To do that you have to use output.collect(word, NullWritable.get()) in your Mapper. Note that NullWritable is Singleton.
2. If you do not want to have multiple files you can set the number of reducers to 1. But this would incur additional overhead since this will involve a lot of data shuffling over the network. Reason being, the Reducer has to get its input form n different machines where Mappers were running. Also, all the load will go to just one machine. But you can definitely use one mReducer if you want just one output file. conf.setNumReduceTasks(1) should be sufficient to achieve that.
A couple of small suggestions :
Upvotes: 4
Reputation: 35405
If it is a map-only job, the number of output files will be equal to the number of mappers. If reducers are needed, it will be equal to the number of reducers. But you can always do hadoop dfs -getmerge <hdfs output directory> <some file>
to merge all outputs in the output directory into one file.
You can output plain text files using TextOutputFormat
, like job.setOutputFormat(TextOutputFormat.class)
. Then change the map
method above to use OutputCollector<NullWritable, Text>
and output.collect(null, "some text")
. This will write some text
for all records. If you want tab-separated key-values, you can change it to OutputCollector<Text, Text>
and output.collect("key", "some text")
. This will print key<tab>some text
in the output.
Upvotes: 0