Reputation: 577
I am at some very early stages of learning Pig/Pig Latin, so forgive the lack of knowledge. If we had a dataset that was in the format of something like:
fname, lname, month, pay, emp_category
Bob, Smith, January, 2000, non-manager
Bob, Smith, February, 2000, non-manager
John, Doe, January, 4500, manager
John, Doe, February, 4500, manager
I see how I would total each employee's pay up , put what if I wanted to create a dataset that also subtracts each employees total pay from each other employee's total pay:
Bob, Smith 4000, non-manager, John Doe, 9000, manager, 5000
With SQL I would probably just create two temp tables with:
SELECT fname, lname, sum(pay) as total_pay_m WHERE category = 'manager' INTO M_table FROM TABLE_NAME;
SELECT fname, lname, sum(pay) as total_pay_nm WHERE category = 'non_manager' INTO NM_table FROM TABLE_NAME;
SELECT *, ABS(total_pay_nm - total_pay_m) as PayDiff FROM M_table, NM_table WHERE M_table.fname <> NM_table.fname and M_table.lname <> NM_table.lname;
The SQL might not be perfect, but I hope the point is understood, but any assistance on achieving this with Pig would be appreciated.
Upvotes: 1
Views: 120
Reputation: 3078
Fun exercise!
Input:
Bob,Smith,January,2000,non-manager
Bob,Smith,February,2000,non-manager
John,Doe,January,4500,manager
John,Doe,February,4500,manager
Susan,Smith,January,4800,manager
Susan,Smith,February,4800,manager
Pig:
A = load 'tmp.csv' using PigStorage(',') as (fname:chararray, lname:chararray, month:chararray, pay:int, emp_category:chararray);
split A into m if emp_category == 'manager', nm if emp_category == 'non-manager';
m_tmp = group m by (fname, lname);
m_pay = foreach m_tmp generate group.fname, group.lname, SUM(m.pay) as total_pay;
nm_tmp = group nm by (fname, lname);
nm_pay = foreach nm_tmp generate group.fname, group.lname, SUM(nm.pay) as total_pay;
C = cross m_pay, nm_pay;
R = foreach C generate *, nm_pay::total_pay - m_pay::total_pay;
dump R;
Output:
(John,Doe,9000,Bob,Smith,4000,-5000)
(Susan,Smith,9600,Bob,Smith,4000,-5600)
Upvotes: 1
Reputation: 577
I think using CROSS may give me what I need to get the combinations that I need, then subtract the two columns.
Upvotes: 0