RebeccaK375
RebeccaK375

Reputation: 901

HiveQL Query to Find a Delta Between Rows if a Condition matches

I have some data in data lake:

Person |  Date    |  Time    |  Number of Friends  |  
Bob    |  02/01   | unix_ts1 |  5                  |
Kate   |  02/01   | unix_ts2 |  2                  |
Jill   |  02/01   | unix_ts3 |  3                  |
Bob    |  02/01   | unix_ts3 |  7                  |
Kate   |  02/02   | unix_ts4 |  10                 |
Jill   |  01/29   | unix_ts0 |  1                  |

I would like to produce a table like so:

Person |  Date    |  Time    |  Number of Friends DELTA  | Found Diff Between
Bob    |  02/01   | unix_ts1 |  NaN                      | (5, NaN)
Kate   |  02/01   | unix_ts2 |  NaN                      | (2, NaN)
Jill   |  02/01   | unix_ts3 |  2                        | (3, 1)
Bob    |  02/01   | unix_ts3 |  2                        | (7, 5)
Kate   |  02/02   | unix_ts4 |  8                        | (10, 2)

So, I have a table where each row is identified by a person's name and a time at which the data was recorded. I would like a query that will go and find instances of "Bob" and find deltas for consecutive timestamps and then give the delta, as well as the two values it found the diff between. I would like this to happen for each person.

I found a method to do this when there is just one value, using lag() command, but that would not do a match by Person. I also know how to do this in Pandas if I downloaded the data, but I am wondering if there is a way to do this in Hive.

Is there a way to do this? Thank you!

Upvotes: 0

Views: 271

Answers (1)

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

Using lag window function.

select person
      ,date
      ,time
      ,num_friends-lag(num_friends) over(partition by person order by time) as delta
      ,concat_ws(',',num_friends,lag(num_friends) over(partition by person order by time)) as found_diff_between
from tbl

Upvotes: 1

Related Questions