Redshift Row_Number() Query with partitions that restart

Question

I have data with id, timestamp(ts) event, capital_event_bool, and prev_event_capital_bool.

id   ts          event   capital_event_bool    prev_event_capital_bool
001   00:01       a          0                     0
002   00:02       b          0                     0
002   00:03       b          0                     0
002   00:04       b          1                     0
002   00:05       c          0                     1
003   00:03       c          0                     0
003   00:04       b          0                     0
003   00:05       b          1                     0
003   00:06       b          0                     1
003   00:07       b          0                     0
003   00:08       b          1                     0

Only "b" events can have a capital_event_bool = True. What I would like to accomplish is have a way to count all capital_event_bool = False b events prior to every capital_event_bool = True event for every id. I originally thought I could accomplish this via the row_number() window function in Redshift with

ROW_NUMBER() OVER (PARTITION BY id, event, capital_event_bool  ORDER BY ts) AS row_num

but the part that is tripping me up is how to get the count to restart after every capital_event_bool = True event. It is fine if the row numbering will stop at every capital_event_bool = True event and then restart because I can just use a case statement with the capital_event_bool to reach my final result.

row_num      DESIRED only row_num       Final Desired Result   
1              1                                0
1              1                                0
2              2                                0
1              3                                2
1              1                                0
2              2                                0
1              1                                0
1              2                                1
2              1                                0
3              2                                0 
2              3                                2

Gordon Linoff · Accepted Answer

This is a type of gap-and-islands problem. Basically, you need to define subsets of the data by the number of "1" in the "b" columns. For this purpose, an inverse sum of capital_event_bool does exactly what you want. Then, you can use window functions on this group:

select t.*,
       (case when capital_event_bool = 1
             then sum( (event = 'b')::int ) over (partition by id, grp) - 1
             else 0
        end) as final_result
from (select t.*,
             sum(capital_event_bool) over (partition by id order by ts desc) as grp
      from t
     ) t

Redshift Row_Number() Query with partitions that restart

Answers (1)

Related Questions