Deyang
Deyang

Reputation: 510

Counting multiple keys in 1 MapReduce

I currently have a list of trades with the following columns:

TradeID, SellerID, FishType, Price, Qty

I am looking to get

  1. The count of each SellerID
  2. The Qty for each FishType

Currently, I have written a mapper that outputs < FishType , Qty > so as to sum up the Qty for each FishType. However, to get the count of SellerID, do I have to write a separate mapreduce task? Or is there a way for me to do so within the same mapreduce task?

I have considered using Counters, however, the sellerID in the records are unknown to me at the time of coding, and there are probably too many to keep track using counters. It is also an abuse of the Counter feature in my opinion.

Please advice.

Upvotes: 2

Views: 1576

Answers (1)

Donald Miner
Donald Miner

Reputation: 39893

The obvious way to do this is to have a separate mapreduce job.

The trickier way to do this, is to "overload" your keys. I'm guessing SellerID and FishType are both strings. When it is a SellerID add "S:" to the front of the string, and for FishType add "F:" to the front of the string.

Then, when you reach the reducer, you'll get one of two different types of calls of the reduce function: It starts with "S:", in which case it is a SellerID, and one where It starts with "F:", in which case it is a FishType. You have separate logic in the reducer based on this observation.

Finally, you use MultipleOutputs (be careful of the .mapred. vs. .mapreduce. versions, they aren't compatible) to write out the results to two different directories: one for FishType and one for SellerID.


You might want to be using Pig or Hive for this.

Upvotes: 4

Related Questions