in PySpark, delete rows from one dataframe that match rows from a second data frame

Question

Again, on AWS with PySpark, can't seem to get it right, somehow. My first dataframe which is the result of a prior data transformation, looks like this as DF1:

| id | class |
| 2 | Bob |
| 4 | Darryl |
| 6 | Frank |

My second dataframe, which is the raw data, looks like this as DF2:

| id | amount |
| 1 | 5 | 
| 5 | 3 |
| 2 | 3 |
| 4 | 3 |
| 3 | 2 |
| 4 | 2 |
| 2 | 5 |
| 4 | 1 |
| 6 | 0 |

I need to sum the values in each class. My result needs to look like this as DF3:

| 2 | 8 |
| 4 | 5 |
| 6 | 0 |

My logic is to delete all rows in DF2 where the DF2.id does not match the DF1.id, and then sum all the classes. In SQL, I would join with DF1 and DF2 where DF1.id matches DF2.id containing just DF1.id and DF2.value, then do a Group By query grouping by id and summing value. I have been thrashing, and at this point almost ready to give up. Ideas, please? Thanks.

mck · Accepted Answer

Just as what you said, you can join on id, then group by id and sum up amount:

import pyspark.sql.functions as F

df3 = df2.join(df1, 'id').groupBy('id').agg(F.sum('amount').alias('sum_amount'))

df3.show()
+---+----------+
| id|sum_amount|
+---+----------+
|  6|         0|
|  4|         6|
|  2|         8|
+---+----------+

in PySpark, delete rows from one dataframe that match rows from a second data frame

Answers (1)

Related Questions