stackoverflowuser2010
stackoverflowuser2010

Reputation: 40909

Pig: Running two aggregation functions

I am new to Pig and would like to run two aggregation functions, but I am not sure how to do it. My data comprises one purchase transaction per row, where I have an SKU (inventory identifier) and the price that a customer paid for the SKU (where the price may vary):

sku   price_paid
---   ----------
123        21.70
789        62.12
123        22.10
123        19.78
456        11.91
789        55.13

I would like to generate the following list, which contains the SKU, the number of times the SKU was purchased, and the average price paid for the SKU. The list should be sorted by the count in descending order.

sku      count  ave_price_paid
---  ---------  --------------
123          3           21.19
789          2           58.63 
456          1           11.91

Any help would be appreciated. I currently have not gotten far:

A = LOAD 'mydata.csv' USING PigStorage(',') AS (sku:chararray, price_paid:double);
B = GROUP A BY sku;

Upvotes: 0

Views: 600

Answers (2)

Andy Botelho
Andy Botelho

Reputation: 771

This should do the trick for you:

C = FOREACH B GENERATE group, COUNT(A) AS count:long, AVG(A.price_paid) AS avg:double;
D = ORDER C BY count DESC;

Explanation:

Let's start with the code you already have:

A = LOAD 'mydata.csv' USING PigStorage(',') AS (sku:chararray, price_paid:double);
B = GROUP A BY sku;

Now we have the relation B, which consists of two fields: a group name and its corresponding bag. The group name is just the name of a group, like '123'. The bag is going to be a list of all of the rows for that particular group. For example:

sku         bag
----       ----
123        {{123, 21.70}, {123, 22.10}, {123,19.78}}

If you enter DESCRIBE B; then you should get the following:

{group: chararray, A: {sku:chararray, price_paid:double}}

In this schema, A is the bag, and out of the possibly many items in a bag, each has a sku field and price_paid field.

Now we need to use Pig's powerful FOREACH...GENERATE statement:

C = FOREACH B GENERATE group, COUNT(A) AS count:long, AVG(A.price_paid) AS avg:double;

This line above is saying that for each row in the relation B, output:

  1. the name of the group
  2. the COUNT() of the group's bag - the number of rows in the entire bag. Using the AS clause, name this column 'count' and make it a long.
  3. the AVG() of the all of the price_paid fields in the bag. Using the AS clause, name this column 'avg' and make it a double.

Lastly:

D = ORDER C BY count DESC;

This will sort relation C in descending order by the 'count' column.

Upvotes: 1

Ranic
Ranic

Reputation: 443

Aggregate functions take a bag of values and produce a single value. Since the bags that you will operate on are fields that are part of a relation that was created by the GROUP BY statement, Ill start off by explaining that.

GROUP BY will collect all the records with same value for a given key into a bag (a bag is an unordered collection of records). The records of the relation B contain 2 fields:

  1. A key, which is named as group (e.i. sku)
  2. A bag of collected records, which name is given for the alias(relation name) that it was grouped (e.i. A). Notice below how does your bag "inherits" the same schema as A.

Let's try to use the DESCRIBE statement, which will show you the schema of a relation, e.g. if you do:

DESCRIBE B;

The output is:

B: {group: chararray,A: {(sku: chararray,price_paid: double)}}

which corresponds to the explanation above.

Having this in mind, now you can perform the following statement:

C = FOREACH B GENERATE group, COUNT(A) as (count:long), AVG(A.price_paid) as (avg:double);

COUNT counts the number of records in a bag, and AVG averages all values provided as an input, e.i. the price_paid values from the tuple records in the bag (notice the way you need to access them!)

Then you perform the ORDERING:

D = ORDER C BY count desc;

Here is the complete code:

A = LOAD 'pathOfYourFile' as (sku:chararray, price_paid:double);
B = GROUP A BY sku;
C = FOREACH B GENERATE group, COUNT(A) as (count:long), AVG(A.price_paid) as (avg:double);
D = ORDER C BY count desc;

For more information on the pig built-in functions, you can check out the apache reference: http://pig.apache.org/docs/r0.13.0/func.html

Upvotes: 3

Related Questions