Helin Wang
Helin Wang

Reputation: 4202

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).

a pseudo code would be:

def mapper(k,v_list):
  for v in v_list:
    if criteria:
      write to HDFS
    else:
      emit

I found it hard because the only thing we can play with is OutputCollector. One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff. Is there any better ways?

Upvotes: 4

Views: 5318

Answers (4)

RobertoP
RobertoP

Reputation: 637

You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.

From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, 
into the output path set by setOutputPath(Path). The framework does not sort 
the map-outputs before writing them out to the FileSystem.

Upvotes: 3

naveenkumarbv
naveenkumarbv

Reputation: 117

We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.

We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)

I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.

I hope this helps.

Upvotes: 0

Jay R.
Jay R.

Reputation: 32181

Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.

There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.

I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.

Upvotes: 0

Chris White
Chris White

Reputation: 30089

I'm assuming that you're using streaming, in which case there is no standard way of doing this.

It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.

Upvotes: 1

Related Questions