SNSI
SNSI

Reputation: 11

Mapper with multipleInput on Hadoop cluster

I have to implement two mapReduce jobs where a Mapper in phase II (Mapper_2) needs to have an output of the Reducer in phase I (reducer_1).

Mapper_2 also needs another input that is a big text file (2TB).

I have written as follows but my question is: text input will be split amongst nodes in the cluster, but what about output of reducer _1 as I want each mapper in phase II to have the whole of Reducer_1's output.

MultipleInputs.addInputPath(Job, TextInputPath, SomeInputFormat.class, Mapper_2.class);
MultipleInputs.addInputPath(Job, Ruducer_1OutputPath, SomeInputFormat.class, Mapper_2.class);

Upvotes: 0

Views: 87

Answers (1)

milk3422
milk3422

Reputation: 660

Your use of multiple inputs seems fine. I would look at using the distributed cache for the output of reducer_1 to be shared with mapper_2.

JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/path/to/reducer_1/ouput"), 
                               job);

Also, when using Distributed Cache, remember to read the cache file in the setup method of mapper_2.

setup() runs once for each mapper before map() gets called and cleanup() runs once for each mapper after the last call to map()

Upvotes: 1

Related Questions