Reputation: 243
I have a mapreduce job that does some processing and produces a composite key (implements WritableComparable) of city:fruit with an associated count. Now I want to chain it with a secondary mapreduce job that determines the city with the highest count for each fruit type.
Sample composite key output from mapreduce job 1:
+---------------------+-------+
| city:fruit composite| count |
+---------------------+-------+
| london:apples | 3 |
+---------------------+-------+
| london:bannanas | 2 |
+---------------------+-------+
| london:oranges | 15 |
+---------------------+-------+
| charleston:apples | 20 |
+---------------------+-------+
| charleston:bannanas | 1 |
+---------------------+-------+
| charleston:oranges | 3 |
+---------------------+-------+
| chicago:bannanas | 17 |
+---------------------+-------+
| chicago:apples | 5 |
+---------------------+-------+
| chicago:oranges | 11 |
+---------------------+-------+
Desired output from job 2:
+------------+----------+
| city | fruit |
+------------+----------+
| london | oranges |
+------------+----------+
| charleston | apples |
+------------+----------+
| chicago | bannanas |
+------------+----------+
How can I accomplish this? In my SQL mind, the composite key would be two columns, one for city, one for fruit. I would group by the fruit, sort, and grab the row with the highest count. I can't figure out how that translates to the mapreduce world. Any advice would be appreciated!
Upvotes: 1
Views: 1118
Reputation: 4971
Process
Be aware that for each reducer a separate file is written. You can easily merge them afterwards with HDFS functionality. There is also the possibility to only have one reducer, however I didn't like this way because it is not scalable.
Upvotes: 2