Could hadoop let mapper send partial data to reducer, and output the remaining data to HDFS directly?

Question

As the title said, I have a task design about map-reduce:

After thinking, I thought that only partial data (may 10% data) need to be sent into the reducer, the remaining data just output to HDFS directly. Then in the end, I just combine these two outputs file from mapper and reducer (I must get a unified file or directory about this total data). I think, by doing this, it can reduce the the bandwidth cost for this task running.

So can my thought implement? (I know how to output to HDFS from mapper directly, but this require mapper both output to HDFS and send data to reducer)

SSaikia_JtheRocker · Accepted Answer

One solution would be to use MultipleOutputs's write() method for 90% of the files and for the remaining 10% you can use the normal context.write() from the mapper, so that, they goes to the reducer only.

This function from MultipOutputs can be used -

void write(K key, V value, String baseOutputPath);

Second solution would be to use FileSystem (from Hadoop Api) directly for the Mapper to output 90% of the files to HDFS. But I don't know how efficient it would be if you are running lot of mappers. Same goes with MultipleOutput above -

Something like:

In the setup() function of mapper do this -

FileSystem fs = file.getFileSystem(context.getConfiguration());
FSDataOutputStream fileOut = fs.create(new Path("your_hdfs_filename"));

While inside the map() function do this -

create() function would return you a FSDataOutputStream object. Use the write() function to write to the file.

Close the FileSystem object in the cleanup() function after you are done. Something like - fs.close();

Could hadoop let mapper send partial data to reducer, and output the remaining data to HDFS directly?

Answers (1)

Related Questions