Reputation: 181
I have two DF one is full load and other is incremental, I want to add data for the customer_ID which are same in both the data_frames. So that my final output is that the customer_ID is same but the value is updated. I want to do it in data frame, I have another approach which is upsert but for that few staging tables have to be made which I want to avoid , how can I do add the data to the main table
Full Table
customer_ID | amount |
---|---|
ABC1234 | 20 |
ABC1235 | 10 |
ABC2222 | 20 |
ABC1345 | 10 |
ABC4444 | 5 |
ABC5555 | 9 |
incremental_load
customer_ID | amount |
---|---|
ABC1234 | 5 |
ABC1235 | 5 |
ABC2222 | 7 |
ABC1345 | 20 |
ABC9999 | 15 |
Desired_output
customer_ID | amount |
---|---|
ABC1234 | 25 |
ABC1235 | 15 |
ABC2222 | 27 |
ABC1345 | 30 |
ABC4444 | 5 |
ABC5555 | 9 |
ABC9999 | 15 |
Upvotes: 1
Views: 58
Reputation: 14845
You can union both dataframes and then sum all entries for the same customer:
from pyspark.sql import functions as F
dfFull= ...
dfInc=...
dfResult=dfFull.union(dfInc).groupBy('customer_ID').agg(F.sum('amount').alias('amount'))
Upvotes: 1