MrFlom
MrFlom

Reputation: 99

Pig: Group By, Average, and Order By

I am new to pig and I have a text file where each line contains a different record of information in the following format:

name, year, count, uniquecount

For example:

Zverkov winced_VERB 2004    8   8
Zverkov winced_VERB 2008    4   4
Zverkov winced_VERB 2009    1   1
zvlastni _ADV_  1913    1   1
zvlastni _ADV_  1928    2   2
zvlastni _ADV_  1929    3   2

I want to group all the records by their unique names, then for each unique name calculate count/uniquecount, and finally sort the output by this calculated value.

Here is what I have been trying:

bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count) / SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average;

Upvotes: 1

Views: 1627

Answers (1)

MrFlom
MrFlom

Reputation: 99

It seems my original code does produce the desired output with one minor change:

bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count)/SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average DESC, group ASC;

Upvotes: 2

Related Questions