How to "sort" descending from the 30 most frequent words?

Question

My mapper (Hadoop 1.2.1) creates key-values pairs of tokens, which I read from a simple text file. No rocket science. The reducer finally "bundles" (in Hadoop, do you call that grouping like in SQL?) the same keys and also sums the values of 1. This is the default Hadoop tutorial.

However, when these values are available to my reducer, I want to sort all of them descending. Only displaying the top 30 tokens (strings, words).

It seems like some concepts are not clear to me.

First, the reduce method is invoked for every key-value pair, right? Thus, I don't see a place to buffer something like a HashMap, which could hold the top results (most frequent tokens).

I was thinking that if I had such a variable, I could easily compar and insert every key that has a value within the top 30. What is the appropriate approach to handle this frequency-ranking task?

public static class Reduce extends MapReduceBase implements
            Reducer {

public void reduce(Text key, Iterator values,
        OutputCollector output, Reporter reporter)
        throws IOException {

    int sum = 0;
    while (values.hasNext()) {
        sum += values.next().get();
    }
            // CURRENTLY I SIMPLY OUTPUT THE KEY AND THE SUM.
            // IN THIS PLACE, HOW COULD YOU STORE E.G. A HASHMAP THAT
            // COULD STORE THE TOP 30?
    output.collect(key, new IntWritable(sum));
    LOG.info("REDUCE: added to output:: key: " + key.toString());
}

}

Donald Miner · Accepted Answer

First, the reduce method is invoked for every key-value pair, right? Thus, I don't see a place to buffer something like a HashMap, which could hold the top results (most frequent tokens).

A bit of a nuance: the reduce method is ran once per key, not key-value pair. Each value with that key is presented in the Iterator. If you want to store a HashMap, you can set that up in the setup function (or make it a private object), interact with it in the reduce function, and then do whatever with it in the cleanup function. So it is definitely possible to maintain state across calls to reduce.

I think you might be able to solve your problem in a bit more clever way, however. I've written about top-ten lists a number of times, just because I find them interesting and they are very useful tools. I hope it's obvious how top-30 relates to top-10.

Here is an example of a top-ten list generator I wrote a while back that can be adapted to your problem. You may be able to change a bit how your are solving your problem to fit this pattern. In my code I use a TreeMap instead of a HashMap, because the TreeMap keeps the things in sorted order. Once you get to 31 items, pop off the one with the lowest frequency.
I also discuss the top-ten pattern in the book MapReduce Design patterns (sorry for the shameless plug).
I blogged about top ten lists a few months back.

How to "sort" descending from the 30 most frequent words?

Answers (1)

Related Questions

How to &quot;sort&quot; descending from the 30 most frequent words?

Answers (1)

Related Questions

How to "sort" descending from the 30 most frequent words?