Reputation: 181
I have a table which as row number window function running for some IDs. Now every time a new data comes its a full load and the new row numbers are assigned to them again. So Row Num runs on the entire data set again , which is quite ineffeciet as lot of resources get consumed and it makes it CPU intensive. This table is built every 15 to 30 mins. I am trying to achieve the same thing but using incremental and then add the result of the incremental to the last row_count of a particular customer_ID
So when new record comes , I want to save the max row_num for that particular record lets say max_row_num = 4 , now two new record comes for a ID , so row_num for incremental is 1,2. Final output should be 4+1 and 4+2 something. so the new row number looks like 1,2,3,4,5,6 adding 1 and 2 to the max of the previous Row_num.
I want to implement the logic in my Pyspark actually! But I am open to python solution and then may be convert to pyspark DataFrame later.
Please help and suggest the possible solutions
Full load -- intial table
Row_num | customer_ID |
---|---|
1 | ABC123 |
2 | ABC123 |
3 | ABC123 |
1 | ABC125 |
2 | ABC125 |
1 | ABC225 |
2 | ABC225 |
3 | ABC225 |
4 | ABC225 |
5 | ABC225 |
incremental load
Row_num | customer_ID |
---|---|
1 | ABC123 |
2 | ABC123 |
1 | ABC125 |
1 | ABC225 |
2 | ABC225 |
1 | ABC330 |
DESIRED OUPUT
Row_num | customer_ID |
---|---|
1 | ABC123 |
2 | ABC123 |
3 | ABC123 |
4 | ABC123 |
1 | ABC125 |
2 | ABC125 |
3 | ABC125 |
1 | ABC225 |
2 | ABC225 |
3 | ABC225 |
4 | ABC225 |
5 | ABC225 |
6 | ABC225 |
1 | ABC330 |
Upvotes: 2
Views: 678
Reputation: 1269973
If you are trying to insert the values with the new row number, you can join in the maximum existing row number:
insert into full (row_num, customer_id)
select i.row_number + coalesce(f.max_row_number, 0), i.customer_id
from incremental i left join
(select f.customer_id, max(row_number) as max_row_number
from full f
group by f.customer_id
) f
on i.customer_id = f.customer_id;
Upvotes: 2