Reputation: 309
I need to chain multiple MapReduce streaming jobs in order to perform some computation over a large dataset.
I intend to use multiple reducers for each job in order to quicken the overall job. As a workflow scheduler I use Oozie.
Here is an illustration to clarify my problem: Let say I have two files
File 1: File 2:
A B 1 A B 3
A C 4 C D 6
B D 2 B D 1
I'd like to have two mappers and two reducers and get the following output for the MapReduce job:
Output:
A B 4
A C 4
B D 3
C D 6
But this is not at all what I get, instead I have partial sums.
Here is what I think happens.
Since I have multiple reducers for each MapReduce job, the input of the next job is split into several files. These files are given to the mappers which then send their output to the reducers. It seems that the mappers send their output to the reducers without waiting the whole input to be processed and sorted with name1, for example, as the key.
I've read several threads about using multiple files as an input and I don't think it is a matter of performing a map side join. Maybe it has to do with partitioning but I haven't exactly understood what partitioning consists in.
Is there any way to sort the output of several mappers before sending it to reducers ? Or can I tell Oozie to merge the output of several reducers in order to have only one file as the input of the next MapReduce Job ?
Upvotes: 1
Views: 453
Reputation: 99
I'm slightly new to MapReduce, but it looks like your job isn't processing the keys correctly, if you are not getting the desired output based on your example.
By default, Hadoop streaming uses Tab as the default field separator and takes everything from the start of a line to the first Tab character as the Key. In your case, if your input format is actually "A[space]B[space]1", you'll need to add
-D stream.map.output.field.separator= \
-D stream.num.map.output.key.fields=2 \
to your Hadoop streaming command in order to set space as the column delimiter and the first 2 columns as the key. This will map all the lines that start with "A B" to the same reducer. More info can be found here
Upvotes: 1