How to process Row number window function on incremental data

Question

I have a table which as row number window function running for some IDs. Now every time a new data comes its a full load and the new row numbers are assigned to them again. So Row Num runs on the entire data set again , which is quite ineffeciet as lot of resources get consumed and it makes it CPU intensive. This table is built every 15 to 30 mins. I am trying to achieve the same thing but using incremental and then add the result of the incremental to the last row_count of a particular customer_ID

So when new record comes , I want to save the max row_num for that particular record lets say max_row_num = 4 , now two new record comes for a ID , so row_num for incremental is 1,2. Final output should be 4+1 and 4+2 something. so the new row number looks like 1,2,3,4,5,6 adding 1 and 2 to the max of the previous Row_num.

I want to implement the logic in my Pyspark actually! But I am open to python solution and then may be convert to pyspark DataFrame later.

Please help and suggest the possible solutions

Full load -- intial table

Row_num	customer_ID
1	ABC123
2	ABC123
3	ABC123
1	ABC125
2	ABC125
1	ABC225
2	ABC225
3	ABC225
4	ABC225
5	ABC225

incremental load

Row_num	customer_ID
1	ABC123
2	ABC123
1	ABC125
1	ABC225
2	ABC225
1	ABC330

DESIRED OUPUT

Row_num	customer_ID
1	ABC123
2	ABC123
3	ABC123
4	ABC123
1	ABC125
2	ABC125
3	ABC125
1	ABC225
2	ABC225
3	ABC225
4	ABC225
5	ABC225
6	ABC225
1	ABC330

Gordon Linoff · Accepted Answer

If you are trying to insert the values with the new row number, you can join in the maximum existing row number:

insert into full (row_num, customer_id)
    select i.row_number + coalesce(f.max_row_number, 0), i.customer_id
    from incremental i left join
         (select f.customer_id, max(row_number) as max_row_number
          from full f
          group by f.customer_id
         ) f
         on i.customer_id = f.customer_id;

How to process Row number window function on incremental data

Answers (1)

Related Questions