Reputation: 773
Again, on AWS with PySpark, can't seem to get it right, somehow. My first dataframe which is the result of a prior data transformation, looks like this as DF1:
| id | class |
| 2 | Bob |
| 4 | Darryl |
| 6 | Frank |
My second dataframe, which is the raw data, looks like this as DF2:
| id | amount |
| 1 | 5 |
| 5 | 3 |
| 2 | 3 |
| 4 | 3 |
| 3 | 2 |
| 4 | 2 |
| 2 | 5 |
| 4 | 1 |
| 6 | 0 |
I need to sum the values in each class. My result needs to look like this as DF3:
| 2 | 8 |
| 4 | 5 |
| 6 | 0 |
My logic is to delete all rows in DF2 where the DF2.id does not match the DF1.id, and then sum all the classes. In SQL, I would join with DF1 and DF2 where DF1.id matches DF2.id containing just DF1.id and DF2.value, then do a Group By query grouping by id and summing value. I have been thrashing, and at this point almost ready to give up. Ideas, please? Thanks.
Upvotes: 0
Views: 345
Reputation: 42422
Just as what you said, you can join on id
, then group by id
and sum up amount
:
import pyspark.sql.functions as F
df3 = df2.join(df1, 'id').groupBy('id').agg(F.sum('amount').alias('sum_amount'))
df3.show()
+---+----------+
| id|sum_amount|
+---+----------+
| 6| 0|
| 4| 6|
| 2| 8|
+---+----------+
Upvotes: 1