feder
feder

Reputation: 1795

How to "sort" descending from the 30 most frequent words?

My mapper (Hadoop 1.2.1) creates key-values pairs of tokens, which I read from a simple text file. No rocket science. The reducer finally "bundles" (in Hadoop, do you call that grouping like in SQL?) the same keys and also sums the values of 1. This is the default Hadoop tutorial.

However, when these values are available to my reducer, I want to sort all of them descending. Only displaying the top 30 tokens (strings, words).

It seems like some concepts are not clear to me.

Upvotes: 1

Views: 3302

Answers (1)

Donald Miner
Donald Miner

Reputation: 39933

First, the reduce method is invoked for every key-value pair, right? Thus, I don't see a place to buffer something like a HashMap, which could hold the top results (most frequent tokens).

A bit of a nuance: the reduce method is ran once per key, not key-value pair. Each value with that key is presented in the Iterator. If you want to store a HashMap, you can set that up in the setup function (or make it a private object), interact with it in the reduce function, and then do whatever with it in the cleanup function. So it is definitely possible to maintain state across calls to reduce.

I think you might be able to solve your problem in a bit more clever way, however. I've written about top-ten lists a number of times, just because I find them interesting and they are very useful tools. I hope it's obvious how top-30 relates to top-10.

  • Here is an example of a top-ten list generator I wrote a while back that can be adapted to your problem. You may be able to change a bit how your are solving your problem to fit this pattern. In my code I use a TreeMap instead of a HashMap, because the TreeMap keeps the things in sorted order. Once you get to 31 items, pop off the one with the lowest frequency.

  • I also discuss the top-ten pattern in the book MapReduce Design patterns (sorry for the shameless plug).

  • I blogged about top ten lists a few months back.

Upvotes: 1

Related Questions