user1624577
user1624577

Reputation: 577

all combinations using pig

I am at some very early stages of learning Pig/Pig Latin, so forgive the lack of knowledge. If we had a dataset that was in the format of something like:

fname, lname, month, pay, emp_category

Bob, Smith, January, 2000, non-manager

Bob, Smith, February, 2000, non-manager

John, Doe, January, 4500, manager

John, Doe, February, 4500, manager

I see how I would total each employee's pay up , put what if I wanted to create a dataset that also subtracts each employees total pay from each other employee's total pay:

Bob, Smith 4000, non-manager, John Doe, 9000, manager, 5000

With SQL I would probably just create two temp tables with:

SELECT fname, lname, sum(pay) as total_pay_m WHERE category = 'manager' INTO M_table FROM TABLE_NAME;

SELECT fname, lname, sum(pay) as total_pay_nm WHERE category = 'non_manager' INTO NM_table FROM TABLE_NAME;

SELECT *, ABS(total_pay_nm - total_pay_m) as PayDiff FROM M_table, NM_table WHERE M_table.fname <> NM_table.fname and M_table.lname <> NM_table.lname;

The SQL might not be perfect, but I hope the point is understood, but any assistance on achieving this with Pig would be appreciated.

Upvotes: 1

Views: 120

Answers (2)

LiMuBei
LiMuBei

Reputation: 3078

Fun exercise!

Input:

Bob,Smith,January,2000,non-manager
Bob,Smith,February,2000,non-manager
John,Doe,January,4500,manager
John,Doe,February,4500,manager
Susan,Smith,January,4800,manager
Susan,Smith,February,4800,manager

Pig:

A = load 'tmp.csv' using PigStorage(',') as (fname:chararray, lname:chararray, month:chararray, pay:int, emp_category:chararray);

split A into m if emp_category == 'manager', nm if emp_category == 'non-manager';

m_tmp = group m by (fname, lname);
m_pay = foreach m_tmp generate group.fname, group.lname, SUM(m.pay) as total_pay;

nm_tmp = group nm by (fname, lname);
nm_pay = foreach nm_tmp generate group.fname, group.lname, SUM(nm.pay) as total_pay;

C = cross m_pay, nm_pay;

R = foreach C generate *, nm_pay::total_pay - m_pay::total_pay;
dump R;

Output:

(John,Doe,9000,Bob,Smith,4000,-5000)
(Susan,Smith,9600,Bob,Smith,4000,-5600)

Upvotes: 1

user1624577
user1624577

Reputation: 577

I think using CROSS may give me what I need to get the combinations that I need, then subtract the two columns.

Upvotes: 0

Related Questions