Green
Green

Reputation: 423

Create columns based on count of each unique value of one column in Pig

I have a dataset such as:

UserID Item EventType
  001   A     Buy
  001   B     Sell
  031   A     Sell
  008   C     Buy
  001   C     Buy
  001   A     Buy
  008   C     Sell

How can I split the EventType column into a different column for each event. That is, I want two new columns EventType_Buy and EventType_Sell containing the counts of the occurrences of those events for each UserID and Item pair.

So the output should be something like this:

UserID Item EventType_Buy EventType_Sell
  001   A         2              0
  001   B         0              1  
  001   C         1              0  
  008   C         1              1  
  031   A         0              1          

I'm not so much interested in the sorting, but I plan to use this data in R later, so I would like some help trying to perform this split into column counts.

I've tried creating separate objects for each event type and grouping by UserID, and Item, and generating the counts and trying to join these objects, but I'm not having much success.

Upvotes: 2

Views: 231

Answers (1)

Murali Rao
Murali Rao

Reputation: 2287

Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach

Pig Script :

input_data = LOAD 'input.csv' USING PigStorage(',') AS (user_id:chararray,item:chararray,event_type:chararray);

req_stats = FOREACH(GROUP input_data BY (user_id,item)) {
                buy_bag = FILTER input_data BY event_type == 'Buy';
                sell_bag = FILTER input_data BY event_type == 'Sell';
                GENERATE FLATTEN(group) AS (user_id,item), COUNT(buy_bag) AS event_type_buy, COUNT(sell_bag) AS event_type_sell; 
};

DUMP req_stats;

Input :

001,A,Buy
001,B,Sell
031,A,Sell
008,C,Buy
001,C,Buy
001,A,Buy
008,C,Sell

Output : DUMP req_stats

(001,A,2,0)
(001,B,0,1)
(001,C,1,0)
(008,C,1,1)
(031,A,0,1)

Upvotes: 2

Related Questions