Reputation: 795
I have chained mapreduce jobs in the following way: map1 -> reduce1 -> map2 -> reduce2 During map1 step as a side effect I calculate data that will be needed only during reduce2 step, so I don't want to pass it all the way through the chain. What is the best way to pass this data so that on reduce2 step I could get data from both map2 and map1?
Thanks
Upvotes: 1
Views: 1464
Reputation: 7462
Based on your comments, you output A and B from mapper 1. Then, you want A to go to reducer 1 and B to go to reducer 2, along with the output of mapper 2. The best option I can see is the following:
JOB 1:
To differentiate A from B, use MultipleOutputs
in the first job... Use a common prefix (e.g. in the values) for the type B intermediate output of mapper 1 that will distinguish them from type A output. In reducer 1, when you see the prefix, remove it and write the B s in the B output path.
JOB 2:
Use MultipleInputs
in your second job. Use mapper 2 for the input that it processes and an IdentityMapper for B. This will simply forward B to reducer 2, where you will also process the output of mapper 2.
A simple code snippet:
MultipleInputs.addInputPath(conf, new Path("/input/path/of/job/2"), SequenceFileInputFormat.class, Mapper2.class);
MultipleInputs.addInputPath(conf, new Path("/path/of/B"), SequenceFileInputFormat.class, IdentityMapper.class);
conf.setReducerClass(Reducer2.class);
where MultipleInputs
is import org.apache.hadoop.mapred.lib.MultipleInputs;
.
You cannot get data in reducer 2 and process them the same way that you process the output of mapper 2, unless you use a mapper for them, too. Generally, you cannot use a reducer without a mapper. The closest to that is to use an IdentityMapper.
If you want to process B in another way, then, you can get them through the Distributed Cache, or if it is a single numer or two, just set a variable with this value, using conf.set("variableName", variableValue);
. Then, you can get this value in the setup()
method of reducer 2, using conf.get("variableName", defaultValue);
.
Upvotes: 1