wsb3383
wsb3383

Reputation: 3881

How do you implement ranking and sorting in map/reduce?

I'm learning the Java map/reduce API in Hadoop and trying to wrap my head around thinking in map/reduce. Here's a sample program i'm writing against apache http server log files, it has two phases (each implemented as a M/R job and then chained together):

  1. Count the number of times each IP address accessed the server
  2. Find the top 5 IP addresses (most requests)

    phase 1 seems pretty trivial, it's a simple counting implementation in map/reduce and it emits the something like the following:

    192.168.0.2  4
    10.0.0.2  7
    127.0.0.1  3
    ...etc
    

This output would feed into the mapper of the second map/reduce job.

Now i'm confused on how to implement the top 5 in a parallel way. Since reducers are sequential in nature, I'm guessing there would only be one reducer that goes against the full list to sort it, right? How do you go about solving step number #2 in a parallel way?

Upvotes: 2

Views: 2210

Answers (1)

Donald Miner
Donald Miner

Reputation: 39883

First of all, if the output of the first job small enough that you don't need to parallelize it, consider:

hadoop fs -cat joboutput/part-* | sort -k2 -n | head -n5

This will likely be faster than sending it all to one reducer in many cases!


Sorting in Hadoop is pretty rough when you try to get away from using only 1 reducer. If you are interested in sorting, try checking out TotalOrderPartioner. By searching the web for that you should find some examples. The fundamental solution is that you have to partition your values into ascending-value bins with a custom partitioner. Then, each bin is sorted naturally. You output, and you have a sorted set.

The hard part is figuring out how to put data into which bins.


If you are interested in top-5 specifically (or top-50, whatever), there is an interesting way to do that. The basic premise is that if you take the top 5 of each mapper, then take the top 5 of the top 5's in the reducer. Each mapper effectively sends their top five to the reducer to compete for the true top five, kind of like a tournament. You are guaranteed to get the top 5 in the reducer, you just need to weed some of them out.

To keep track of the top-5 in both the mapper and reducer, I like to use a TreeMap. Basically, keep inserting values, and keep truncating it to the top 5. In the Mapper#cleanup method, write out the top 5 records (don't write out during the map itself). Do the same for the reducer.


I'll plug Apache Pig here for something like this. It might not be as effective as the options above, but it sure is easier to code.

loaded = LOAD 'joboutput/' USING PigStorage('\t') AS (ip:chararray, cnt:int);
sorted = ORDER loaded BY cnt DESC;
top = LIMIT sorted 5;
dump top;

Sorry that something as simple as sorting is not as straightforward as you might have imagined in Hadoop. Some things are going to be easy (e.g., the ip counting that you did) and others are going to be hard (sorting, joins). Just the nature of the beast.

Upvotes: 2

Related Questions