PySpark: Groupby within groups and display sum in separate fields based on certain values

Question

I have a dataframe which contains Company name, EmpId, Bonus & Salary.

COMPANY	EMPID	BONUS	SALARY
APPLE	1234	No	5
APPLE	1235	No	7
GOOGLE	6786	Yes	6
GOOGLE	6787	No	5
GOOGLE	6788	No	6
TARGET	9091	Yes	4
TARGET	9092	Yes	1
TARGET	9093	Yes	9
TARGET	9094	No	2
TARGET	9094	Yes	4

I want to group by Company name regardless of EmpIds and create 2 different columns which will display total sum of Bonus & salary received by company.

Expected output:

COMPANY	TOT_AMT_NO_BONUS	TOTAL_BONUS
APPLE	12	0
GOOGLE	11	6
TARGET	2	18

Tried with groupBy, Aggregate & Sum but not getting results as expected. Appreciate if someone can help. Thanks in advance.

Derek O · Accepted Answer

You can groupby company, pivot on bonus, and sum salary, then fill any null entries with 0:

sparkDF.groupby('COMPANY').pivot('BONUS').sum('salary').withColumnRenamed(
    'No', 'TOT_AMT_NO_BONUS',
).withColumnRenamed(
    'Yes', 'TOTAL_BONUS'
).fillna(
    value=0, subset=['TOT_AMT_NO_BONUS','TOTAL_BONUS']
)

+-------+----------------+-----------+
|COMPANY|TOT_AMT_NO_BONUS|TOTAL_BONUS|
+-------+----------------+-----------+
|  APPLE|              12|          0|
| TARGET|               2|         18|
| GOOGLE|              11|          6|
+-------+----------------+-----------+

PySpark: Groupby within groups and display sum in separate fields based on certain values

Answers (1)

Related Questions