Prabhanj
Prabhanj

Reputation: 282

Suggest most optimized way using hive or pig

Problem Statement Assume there is one text file of logs. Below are the fields in the file.

Log File

userID
productID
action

Where Action would be one of these –

Browse, Click, AddToCart, Purchase, LogOut

Select users who performed AddToCart action but did not perform Purchase action.

('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')

Can anyone suggest to get this info using hive or pig with optimised performance

Upvotes: 0

Views: 167

Answers (2)

leftjoin
leftjoin

Reputation: 38335

This is possible to do using sum() or analytical sum() depending on exact requirements in a single table scan. What if User added to cart two products, but purchased only one?

For User+Product:

select userID, productID 
  from
(
select 
       userID,
       productID,
       sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
       sum(case when action='Purchase' then 1 else 0 end)  Purchase_cnt
  from table
  group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0

Upvotes: 1

nobody
nobody

Reputation: 11090

Hive: Use not in

select * from table 
where action='AddtoCart' and 
      userID not in (select distinct userID from table where action='Purchase')

Pig: Filter the ids using action and do a left join and check id is null

A = LOAD '\path\file.txt' USING PigStorage(',') AS (userID:int,b:int,c:int,action:chararray) -- Note I am assuming the first 3 columns are int.You will have to figure out the loading without the quotes.
B = FILTER A BY (action='AddToCart');
C = FILTER A BY (action='Purchase');
D = JOIN B BY userID LEFT OUTER,C BY userID;
E = FILTER D BY C.userID is null;
DUMP E;

Upvotes: 0

Related Questions