Add columns values in pyspark for the same customerID

Question

I have two DF one is full load and other is incremental, I want to add data for the customer_ID which are same in both the data_frames. So that my final output is that the customer_ID is same but the value is updated. I want to do it in data frame, I have another approach which is upsert but for that few staging tables have to be made which I want to avoid , how can I do add the data to the main table

Full Table

customer_ID	amount
ABC1234	20
ABC1235	10
ABC2222	20
ABC1345	10
ABC4444	5
ABC5555	9

incremental_load

customer_ID	amount
ABC1234	5
ABC1235	5
ABC2222	7
ABC1345	20
ABC9999	15

Desired_output

customer_ID	amount
ABC1234	25
ABC1235	15
ABC2222	27
ABC1345	30
ABC4444	5
ABC5555	9
ABC9999	15

werner · Accepted Answer

You can union both dataframes and then sum all entries for the same customer:

from pyspark.sql import functions as F

dfFull= ...
dfInc=...

dfResult=dfFull.union(dfInc).groupBy('customer_ID').agg(F.sum('amount').alias('amount'))

Add columns values in pyspark for the same customerID

Answers (1)

Related Questions