Reputation: 40909
I am new to Pig and would like to run two aggregation functions, but I am not sure how to do it. My data comprises one purchase transaction per row, where I have an SKU (inventory identifier) and the price that a customer paid for the SKU (where the price may vary):
sku price_paid
--- ----------
123 21.70
789 62.12
123 22.10
123 19.78
456 11.91
789 55.13
I would like to generate the following list, which contains the SKU, the number of times the SKU was purchased, and the average price paid for the SKU. The list should be sorted by the count in descending order.
sku count ave_price_paid
--- --------- --------------
123 3 21.19
789 2 58.63
456 1 11.91
Any help would be appreciated. I currently have not gotten far:
A = LOAD 'mydata.csv' USING PigStorage(',') AS (sku:chararray, price_paid:double);
B = GROUP A BY sku;
Upvotes: 0
Views: 600
Reputation: 771
This should do the trick for you:
C = FOREACH B GENERATE group, COUNT(A) AS count:long, AVG(A.price_paid) AS avg:double;
D = ORDER C BY count DESC;
Explanation:
Let's start with the code you already have:
A = LOAD 'mydata.csv' USING PigStorage(',') AS (sku:chararray, price_paid:double);
B = GROUP A BY sku;
Now we have the relation B, which consists of two fields: a group name and its corresponding bag. The group name is just the name of a group, like '123'. The bag is going to be a list of all of the rows for that particular group. For example:
sku bag
---- ----
123 {{123, 21.70}, {123, 22.10}, {123,19.78}}
If you enter DESCRIBE B;
then you should get the following:
{group: chararray, A: {sku:chararray, price_paid:double}}
In this schema, A is the bag, and out of the possibly many items in a bag, each has a sku field and price_paid field.
Now we need to use Pig's powerful FOREACH...GENERATE
statement:
C = FOREACH B GENERATE group, COUNT(A) AS count:long, AVG(A.price_paid) AS avg:double;
This line above is saying that for each row in the relation B, output:
COUNT()
of the group's bag - the number of rows in the entire bag. Using the AS
clause, name this column 'count' and make it a long.AVG()
of the all of the price_paid fields in the bag. Using the AS
clause, name this column 'avg' and make it a double.Lastly:
D = ORDER C BY count DESC;
This will sort relation C in descending order by the 'count' column.
Upvotes: 1
Reputation: 443
Aggregate functions take a bag of values and produce a single value. Since the bags that you will operate on are fields that are part of a relation that was created by the GROUP BY statement, Ill start off by explaining that.
GROUP BY will collect all the records with same value for a given key into a bag (a bag is an unordered collection of records). The records of the relation B contain 2 fields:
Let's try to use the DESCRIBE statement, which will show you the schema of a relation, e.g. if you do:
DESCRIBE B;
The output is:
B: {group: chararray,A: {(sku: chararray,price_paid: double)}}
which corresponds to the explanation above.
Having this in mind, now you can perform the following statement:
C = FOREACH B GENERATE group, COUNT(A) as (count:long), AVG(A.price_paid) as (avg:double);
COUNT counts the number of records in a bag, and AVG averages all values provided as an input, e.i. the price_paid values from the tuple records in the bag (notice the way you need to access them!)
Then you perform the ORDERING:
D = ORDER C BY count desc;
Here is the complete code:
A = LOAD 'pathOfYourFile' as (sku:chararray, price_paid:double);
B = GROUP A BY sku;
C = FOREACH B GENERATE group, COUNT(A) as (count:long), AVG(A.price_paid) as (avg:double);
D = ORDER C BY count desc;
For more information on the pig built-in functions, you can check out the apache reference: http://pig.apache.org/docs/r0.13.0/func.html
Upvotes: 3