ph34r
ph34r

Reputation: 243

Java Mapreduce group by compositekey and sort

I have a mapreduce job that does some processing and produces a composite key (implements WritableComparable) of city:fruit with an associated count. Now I want to chain it with a secondary mapreduce job that determines the city with the highest count for each fruit type.

Sample composite key output from mapreduce job 1:

+---------------------+-------+
| city:fruit composite| count |
+---------------------+-------+
| london:apples       | 3     |
+---------------------+-------+
| london:bannanas     | 2     |
+---------------------+-------+
| london:oranges      | 15    |
+---------------------+-------+
| charleston:apples   | 20    |
+---------------------+-------+
| charleston:bannanas | 1     |
+---------------------+-------+
| charleston:oranges  | 3     |
+---------------------+-------+
| chicago:bannanas    | 17    |
+---------------------+-------+
| chicago:apples      | 5     |
+---------------------+-------+
| chicago:oranges     | 11    |
+---------------------+-------+

Desired output from job 2:

+------------+----------+
| city       | fruit    |
+------------+----------+
| london     | oranges  |
+------------+----------+
| charleston | apples   |
+------------+----------+
| chicago    | bannanas |
+------------+----------+

How can I accomplish this? In my SQL mind, the composite key would be two columns, one for city, one for fruit. I would group by the fruit, sort, and grab the row with the highest count. I can't figure out how that translates to the mapreduce world. Any advice would be appreciated!

Upvotes: 1

Views: 1118

Answers (1)

Matthias Kricke
Matthias Kricke

Reputation: 4971

Process

  1. Read your data into a new map reduce job
  2. Split your information into city as key and a compound value of fruit:count
  3. In the reduce phase you have every value for a city at hand. Now you can iterate over all of those values in a loop. Split them up and remember the biggest fruit count and fruit.
  4. Now write your data to a database or the HDFS

Be aware that for each reducer a separate file is written. You can easily merge them afterwards with HDFS functionality. There is also the possibility to only have one reducer, however I didn't like this way because it is not scalable.

Upvotes: 2

Related Questions