sanju
sanju

Reputation: 49

PySpark: Groupby within groups and display sum in separate fields based on certain values

I have a dataframe which contains Company name, EmpId, Bonus & Salary.

COMPANY EMPID BONUS SALARY
APPLE 1234 No 5
APPLE 1235 No 7
GOOGLE 6786 Yes 6
GOOGLE 6787 No 5
GOOGLE 6788 No 6
TARGET 9091 Yes 4
TARGET 9092 Yes 1
TARGET 9093 Yes 9
TARGET 9094 No 2
TARGET 9094 Yes 4

I want to group by Company name regardless of EmpIds and create 2 different columns which will display total sum of Bonus & salary received by company.

Expected output:

COMPANY TOT_AMT_NO_BONUS TOTAL_BONUS
APPLE 12 0
GOOGLE 11 6
TARGET 2 18

Tried with groupBy, Aggregate & Sum but not getting results as expected. Appreciate if someone can help. Thanks in advance.

Upvotes: 2

Views: 44

Answers (1)

Derek O
Derek O

Reputation: 19545

You can groupby company, pivot on bonus, and sum salary, then fill any null entries with 0:

sparkDF.groupby('COMPANY').pivot('BONUS').sum('salary').withColumnRenamed(
    'No', 'TOT_AMT_NO_BONUS',
).withColumnRenamed(
    'Yes', 'TOTAL_BONUS'
).fillna(
    value=0, subset=['TOT_AMT_NO_BONUS','TOTAL_BONUS']
)

+-------+----------------+-----------+
|COMPANY|TOT_AMT_NO_BONUS|TOTAL_BONUS|
+-------+----------------+-----------+
|  APPLE|              12|          0|
| TARGET|               2|         18|
| GOOGLE|              11|          6|
+-------+----------------+-----------+

Upvotes: 1

Related Questions