HiveQL Query to Find a Delta Between Rows if a Condition matches

Question

I have some data in data lake:

Person |  Date    |  Time    |  Number of Friends  |  
Bob    |  02/01   | unix_ts1 |  5                  |
Kate   |  02/01   | unix_ts2 |  2                  |
Jill   |  02/01   | unix_ts3 |  3                  |
Bob    |  02/01   | unix_ts3 |  7                  |
Kate   |  02/02   | unix_ts4 |  10                 |
Jill   |  01/29   | unix_ts0 |  1                  |

I would like to produce a table like so:

Person |  Date    |  Time    |  Number of Friends DELTA  | Found Diff Between
Bob    |  02/01   | unix_ts1 |  NaN                      | (5, NaN)
Kate   |  02/01   | unix_ts2 |  NaN                      | (2, NaN)
Jill   |  02/01   | unix_ts3 |  2                        | (3, 1)
Bob    |  02/01   | unix_ts3 |  2                        | (7, 5)
Kate   |  02/02   | unix_ts4 |  8                        | (10, 2)

So, I have a table where each row is identified by a person's name and a time at which the data was recorded. I would like a query that will go and find instances of "Bob" and find deltas for consecutive timestamps and then give the delta, as well as the two values it found the diff between. I would like this to happen for each person.

I found a method to do this when there is just one value, using lag() command, but that would not do a match by Person. I also know how to do this in Pandas if I downloaded the data, but I am wondering if there is a way to do this in Hive.

Is there a way to do this? Thank you!

Vamsi Prabhala · Accepted Answer

Using lag window function.

select person
      ,date
      ,time
      ,num_friends-lag(num_friends) over(partition by person order by time) as delta
      ,concat_ws(',',num_friends,lag(num_friends) over(partition by person order by time)) as found_diff_between
from tbl

HiveQL Query to Find a Delta Between Rows if a Condition matches

Answers (1)

Related Questions