Filter records by key - PigLatin

Question

I'm starting to get into PigLatin and I have a question...

Right now I'm working with the classic example of word counting, where I process several e-books and then I get the list of words and the number of times each word appears.

Using that data as input data for pig, I then sort it by the number of times each word appears and I get the 5 most common words. So far so good, my problem though, is that now I want to get the 5 most common words but that appear different number of times. Let me explain a bit better:

Imagine this output to the word-count job:

(hey, 1)
(hello, 10)
(my, 2)
(cat, 1)
(eat, 4)
(mom, 10)
(house, 10)

I then do the following on the Grunt shell:

data = load 'file' as (word, freq);
srtd = order data by freq;
lmtd = limit srtd 3;
dump lmtd;

The output I'd get is:

(hello, 10)
(mom, 10)
(house, 10)

But what if I wanted to get this:

(hello, 10)
(eat, 4)
(my, 2)

How would I filter out repeated freq values?

Thanks!

NerdyNick · Accepted Answer

You could write a UDF to do this maybe a little faster MR wise, but you could try one of these.

data = load 'file' as (word, freq);
counts = GROUP data BY freq;
countsLimited = FOREACH counts {
    word = TOP(1, 2, data);
    GENERATE FLATTEN(word);
}

or

data = load 'file' as (word, freq);
counts = GROUP data BY freq;
countsLimited = FOREACH counts {
    word = LIMIT data 1;
    GENERATE word;
}

Filter records by key - PigLatin

Answers (1)

Related Questions