Creating new column in Hive based on previous value of another column

Question

I have data that looks like this:

Status Order
NULL   1
NULL   2
1      3
NULL   4
NULL   5
0      6
NULL   7
NULL   8
NULL   9
1      10

I am trying to add a new column that fills in the nulls with the previous number, such as:

Status     New
NULL       NULL
NULL       NULL
1          1
NULL       1
NULL       1
0          0
NULL       0
NULL       0
NULL       0
1          1

I am using hive, and do have access to pyspark, but was trying to solve through SQL if possible. I have looked into window functions (Rank, Dense_rank, Row_number, Lag, Lead). Any suggestions?

Gordon Linoff · Accepted Answer

What you need are equivalence groups. Then you can use window functions to fill in the data. A group can be defined as the number of non-null values up to an including a given value.

This assumes that you have a column that specifies the ordering.

So:

select t.*, max(status) over (partition by grp) as new_status
from (select t.*, count(status) over (order by ?) as grp
      from t
     ) t;

The ? is for the column that specifies the ordering.

Creating new column in Hive based on previous value of another column

Answers (1)

Related Questions