Reputation: 8690
I'm starting to get into PigLatin and I have a question...
Right now I'm working with the classic example of word counting, where I process several e-books and then I get the list of words and the number of times each word appears.
Using that data as input data for pig, I then sort it by the number of times each word appears and I get the 5 most common words. So far so good, my problem though, is that now I want to get the 5 most common words but that appear different number of times. Let me explain a bit better:
Imagine this output to the word-count job:
(hey, 1)
(hello, 10)
(my, 2)
(cat, 1)
(eat, 4)
(mom, 10)
(house, 10)
I then do the following on the Grunt shell:
data = load 'file' as (word, freq);
srtd = order data by freq;
lmtd = limit srtd 3;
dump lmtd;
The output I'd get is:
(hello, 10)
(mom, 10)
(house, 10)
But what if I wanted to get this:
(hello, 10)
(eat, 4)
(my, 2)
How would I filter out repeated freq values?
Thanks!
Upvotes: 0
Views: 1207
Reputation: 803
You could write a UDF to do this maybe a little faster MR wise, but you could try one of these.
data = load 'file' as (word, freq); counts = GROUP data BY freq; countsLimited = FOREACH counts { word = TOP(1, 2, data); GENERATE FLATTEN(word); }
or
data = load 'file' as (word, freq); counts = GROUP data BY freq; countsLimited = FOREACH counts { word = LIMIT data 1; GENERATE word; }
Upvotes: 1