Reputation: 11
I have to implement two mapReduce jobs where a Mapper in phase II (Mapper_2) needs to have an output of the Reducer in phase I (reducer_1).
Mapper_2 also needs another input that is a big text file (2TB).
I have written as follows but my question is: text input will be split amongst nodes in the cluster, but what about output of reducer _1 as I want each mapper in phase II to have the whole of Reducer_1's output.
MultipleInputs.addInputPath(Job, TextInputPath, SomeInputFormat.class, Mapper_2.class);
MultipleInputs.addInputPath(Job, Ruducer_1OutputPath, SomeInputFormat.class, Mapper_2.class);
Upvotes: 0
Views: 87
Reputation: 660
Your use of multiple inputs seems fine. I would look at using the distributed cache for the output of reducer_1 to be shared with mapper_2.
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/path/to/reducer_1/ouput"),
job);
Also, when using Distributed Cache, remember to read the cache file in the setup method of mapper_2.
setup() runs once for each mapper before map() gets called and cleanup() runs once for each mapper after the last call to map()
Upvotes: 1