user2726995
user2726995

Reputation: 2062

Pig using multiple over partition by

How can I translate the following SQL statement into pig latin?

select userid, siteid, eventdate, 
 count(*) over(partition by userid, siteid order by eventdate) as c, 
 rank() over (partition by userid, siteid order by eventdate) as rank
from views 

I see that pig supports windows functions here: http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/evaluation/Over.html

and I can work out the first partitioning projection (I think) as

A = load views as (userid, siteid, eventdate); 
B = group views by (userid, siteid); 
C = foreach B { 
  C1 = order A by eventdate; 
  generate flatten(Stitch(C1, over(c1.userid, 'count'))); 
  -- how do I translate   rank() over (partition by userid, siteid order by eventdate)
   as rank
} 
D = ??? -- how do I project the fields 

I am unclear on how to use two multiple over clauses , and unclear on how to project out the final step

Upvotes: 1

Views: 1427

Answers (1)

alexeipab
alexeipab

Reputation: 3619

you could use Enumerate from DataFu, it adds index to each tuple in the bag

On top of my head your code might look something like this:

define Enumerate datafu.pig.bags.Enumerate('1');
A = load views as (userid, siteid, eventdate); 
B = group views by (userid, siteid); 
C = foreach B { 
  C1 = order A by eventdate; 
  generate FLATTEN(group) as (userid, siteid), 
           FLATTEN(Enumerate(C1.eventdate)) as (eventdate, rank),
           COUNT(C1) as count;
} 
dump C;

OR

define Enumerate datafu.pig.bags.Enumerate('1');
A = load views as (userid, siteid, eventdate); 
B = group views by (userid, siteid); 
C = foreach B { 
  C1 = order A by eventdate; 
  generate FLATTEN(Enumerate(C1)) as (userid, siteid, eventdate, rank),
           COUNT(C1) as count;
} 
dump C;

Upvotes: 2

Related Questions