thomers
thomers

Reputation: 2693

MapReduce (secondary) sorting / filtering - how?

I have a logfile of timestamped values (concurrent users) of different "zones" of a chatroom webapp in the format "Timestamp; Zone; Value". For each zone exists one value per minute of each day.

For each zone, I want to list the maximum value per day, ordered desc by this maximum value

So, an input file of

#timestamp; zone; value
2011-01-01 00:00:00; 1; 10
2011-01-01 00:00:00; 2; 22
2011-01-01 00:01:00; 1; 11
2011-01-01 00:01:00; 2; 21

2011-01-02 00:00:00; 1; 12
2011-01-02 00:00:00; 2; 20

should produce for zone 1:

2011-01-02    12
2011-01-01    11

and for zone 2:

2011-01-01    22
2011-01-02    20

How would I approach this? IMHO I will need more than one M/R step.

What i have implemented so far is:

This results in a file like

2011-01-01/1    11
2011-01-01/2    22
2011-01-02/1    12
2011-01-02/2    20

Would this be the input for a second M/R step? If so, what would I take as key and value?

I have studied the "Secondary Sort" example in "Hadoop - The Definitive Guide", but I'm not sure whether and how to apply this here.

Is it possible to M/R into several output-files (one per zone)?

UPDATE After thinking about it, I will try this:

Upvotes: 1

Views: 4384

Answers (3)

Pranab
Pranab

Reputation: 663

You can do this with just one MR using secondary sorting. Here are the steps

  1. Define key as concatenation of zone, yyyy-mm-dd and the value as zone:yyyy-mm-dd:value As I will explain, you don't even need to emit any value from the mapper. NullWritable is good enough for the value

  2. Implement key comparator such that zone:yyyy-mm-dd part of the key is ordered ascending and the values part is ordered descending. This will ensure that for all keys for given zone:yyyy-mm-dd, the first key in the group will have the highest value

  3. Define partitioner and grouping comparator of the composite key based on the zone and day part of the key only i.e. zone:yyyy-mm-dd.

  4. In your reducer input, you will get the first key for a key group, which will contain zone, day and the max value for that zone, day combination. The value part of the reducer input will be a list of NullWritable, which can be ignored.

Upvotes: 7

yura
yura

Reputation: 14655

Secondary sort in Map reduce is solved with composite key pattern, therefor you create key like (ZoneId, TImeStamp) and in reducer you will firstly iterate over time zone, and then over timestamps so you can easily evaluate per day maximum.

Upvotes: 0

Joseph Ottinger
Joseph Ottinger

Reputation: 4951

I don't know that you'd need two map/reduce steps - you could certainly do it with one, it's just that your results would be lists instead of single entries. Otherwise, yes, you'd split it up by zones, then split it by date.

I'd probably split it up by zone, then have each zone return a list of the highest elements by day, since the reduction would be really easy at that point. To really get a benefit out of another map/reduction step you'd have to have a really large dataset and a lot of machines to split across - at which point I'd probably do a reduction on the entire key.

Upvotes: 0

Related Questions