Grouping joined data in Hadoop map-reduce

Question

I have two different types of files, the one type is a list of users. It have following structure: UserID,Name,CountryID

And the second type is orders list: OrderID,UserID,OrderSum

Each user have lots of orders. I need to write map-reduce hadoop job (in java) and receive output with following structure: CountryID,NumOfUsers,MinOrder,MaxOrder

It's not a problem for me to write a two different mappers (for each file type) and one reducer in order to join data from both files by UserID, and receive following structure: UserID,CountryID,UsersMinOrder,UsersMaxOrder

But i don't understand how do i group that data by CountryID?

Daniel S. · Accepted Answer

I'd recommend this running this through Pig or Hive, as you can then solve this kind of thing with just a few lines.

Failing that, I would do the following. Run another MapReduce job on your joined data, and do the following: in your mapper, for each input split keep tabs on min order, max order, and number of tuples (rows with unique user id) processed per country id. There are only a few countries, so you can keep these stats in memory throughout the map job. At the end of the split, output accumulated stats to the reducer keyed by country id. Then the reducer simply combines aggregated data from each split to find the global max, min and count.

Grouping joined data in Hadoop map-reduce

Answers (1)

Related Questions