Reputation: 423
I have a dataset such as:
UserID Item EventType
001 A Buy
001 B Sell
031 A Sell
008 C Buy
001 C Buy
001 A Buy
008 C Sell
How can I split the EventType column into a different column for each event. That is, I want two new columns EventType_Buy and EventType_Sell containing the counts of the occurrences of those events for each UserID and Item pair.
So the output should be something like this:
UserID Item EventType_Buy EventType_Sell
001 A 2 0
001 B 0 1
001 C 1 0
008 C 1 1
031 A 0 1
I'm not so much interested in the sorting, but I plan to use this data in R later, so I would like some help trying to perform this split into column counts.
I've tried creating separate objects for each event type and grouping by UserID, and Item, and generating the counts and trying to join these objects, but I'm not having much success.
Upvotes: 2
Views: 231
Reputation: 2287
Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach
Pig Script :
input_data = LOAD 'input.csv' USING PigStorage(',') AS (user_id:chararray,item:chararray,event_type:chararray);
req_stats = FOREACH(GROUP input_data BY (user_id,item)) {
buy_bag = FILTER input_data BY event_type == 'Buy';
sell_bag = FILTER input_data BY event_type == 'Sell';
GENERATE FLATTEN(group) AS (user_id,item), COUNT(buy_bag) AS event_type_buy, COUNT(sell_bag) AS event_type_sell;
};
DUMP req_stats;
Input :
001,A,Buy
001,B,Sell
031,A,Sell
008,C,Buy
001,C,Buy
001,A,Buy
008,C,Sell
Output : DUMP req_stats
(001,A,2,0)
(001,B,0,1)
(001,C,1,0)
(008,C,1,1)
(031,A,0,1)
Upvotes: 2