Java Mapreduce group by compositekey and sort

Question

I have a mapreduce job that does some processing and produces a composite key (implements WritableComparable) of city:fruit with an associated count. Now I want to chain it with a secondary mapreduce job that determines the city with the highest count for each fruit type.

Sample composite key output from mapreduce job 1:

+---------------------+-------+
| city:fruit composite| count |
+---------------------+-------+
| london:apples       | 3     |
+---------------------+-------+
| london:bannanas     | 2     |
+---------------------+-------+
| london:oranges      | 15    |
+---------------------+-------+
| charleston:apples   | 20    |
+---------------------+-------+
| charleston:bannanas | 1     |
+---------------------+-------+
| charleston:oranges  | 3     |
+---------------------+-------+
| chicago:bannanas    | 17    |
+---------------------+-------+
| chicago:apples      | 5     |
+---------------------+-------+
| chicago:oranges     | 11    |
+---------------------+-------+

Desired output from job 2:

+------------+----------+
| city       | fruit    |
+------------+----------+
| london     | oranges  |
+------------+----------+
| charleston | apples   |
+------------+----------+
| chicago    | bannanas |
+------------+----------+

How can I accomplish this? In my SQL mind, the composite key would be two columns, one for city, one for fruit. I would group by the fruit, sort, and grab the row with the highest count. I can't figure out how that translates to the mapreduce world. Any advice would be appreciated!

Matthias Kricke · Accepted Answer

Process

Read your data into a new map reduce job
Split your information into city as key and a compound value of fruit:count
In the reduce phase you have every value for a city at hand. Now you can iterate over all of those values in a loop. Split them up and remember the biggest fruit count and fruit.
Now write your data to a database or the HDFS

Be aware that for each reducer a separate file is written. You can easily merge them afterwards with HDFS functionality. There is also the possibility to only have one reducer, however I didn't like this way because it is not scalable.

Java Mapreduce group by compositekey and sort

Answers (1)

Related Questions