Kunal Ghosh
Kunal Ghosh

Reputation: 1

Computing lag in Hive by a variable

My input table looks like:

guest_id    days
101          79
101          70
101          68
101          61
102          101
102          90
102          55
103          99
103          90

Note that, days are in descending order,by guest_id

Desired output table:

guest_id    days     days_diff
101          79      0
101          70      9
101          68      2
101          61      7
102          101     0
102          90      11
102          55      35
103          99      0
103          90      9

days_diff is the first order difference by guest_id (not throughout days column)

Upvotes: 0

Views: 1073

Answers (1)

maxymoo
maxymoo

Reputation: 36545

You need to have a unique id column as well (otherwise Hive doesn't know about the order of your rows).

Then you can just self join on id=id+1 to get your differences:

select a.guest_id, 
       a.days, 
       case when a.guest_id = b.guest_id then b.days-a.days else 0 end days_diff
from 
input a
join input b on a.id=b.id-1

Edit: As pointed out by Kunal in the comments, Hive does have a Lag window function which requires a PARTITION BY ... ORDER BY clause; you still need something to order your table by, for example if you have a date column you would used this like the following:

SELECT guest_id, 
       days, 
       LAG(days, 1, 0) OVER (PARTITION BY guest_id ORDER BY date)
FROM input;

Upvotes: 1

Related Questions