Reputation: 282
Problem Statement Assume there is one text file of logs. Below are the fields in the file.
Log File
userID
productID
action
Where Action would be one of these –
Browse, Click, AddToCart, Purchase, LogOut
Select users who performed AddToCart action but did not perform Purchase action.
('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')
Can anyone suggest to get this info using hive or pig with optimised performance
Upvotes: 0
Views: 167
Reputation: 38335
This is possible to do using sum() or analytical sum() depending on exact requirements in a single table scan. What if User added to cart two products, but purchased only one?
For User+Product:
select userID, productID
from
(
select
userID,
productID,
sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
sum(case when action='Purchase' then 1 else 0 end) Purchase_cnt
from table
group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0
Upvotes: 1
Reputation: 11090
Hive: Use not in
select * from table
where action='AddtoCart' and
userID not in (select distinct userID from table where action='Purchase')
Pig: Filter the ids using action and do a left join and check id is null
A = LOAD '\path\file.txt' USING PigStorage(',') AS (userID:int,b:int,c:int,action:chararray) -- Note I am assuming the first 3 columns are int.You will have to figure out the loading without the quotes.
B = FILTER A BY (action='AddToCart');
C = FILTER A BY (action='Purchase');
D = JOIN B BY userID LEFT OUTER,C BY userID;
E = FILTER D BY C.userID is null;
DUMP E;
Upvotes: 0