Reputation: 49
I have a dataframe which contains Company name, EmpId, Bonus & Salary.
COMPANY | EMPID | BONUS | SALARY |
---|---|---|---|
APPLE | 1234 | No | 5 |
APPLE | 1235 | No | 7 |
6786 | Yes | 6 | |
6787 | No | 5 | |
6788 | No | 6 | |
TARGET | 9091 | Yes | 4 |
TARGET | 9092 | Yes | 1 |
TARGET | 9093 | Yes | 9 |
TARGET | 9094 | No | 2 |
TARGET | 9094 | Yes | 4 |
I want to group by Company name regardless of EmpIds and create 2 different columns which will display total sum of Bonus & salary received by company.
Expected output:
COMPANY | TOT_AMT_NO_BONUS | TOTAL_BONUS |
---|---|---|
APPLE | 12 | 0 |
11 | 6 | |
TARGET | 2 | 18 |
Tried with groupBy, Aggregate & Sum but not getting results as expected. Appreciate if someone can help. Thanks in advance.
Upvotes: 2
Views: 44
Reputation: 19545
You can groupby company, pivot on bonus, and sum salary, then fill any null entries with 0:
sparkDF.groupby('COMPANY').pivot('BONUS').sum('salary').withColumnRenamed(
'No', 'TOT_AMT_NO_BONUS',
).withColumnRenamed(
'Yes', 'TOTAL_BONUS'
).fillna(
value=0, subset=['TOT_AMT_NO_BONUS','TOTAL_BONUS']
)
+-------+----------------+-----------+
|COMPANY|TOT_AMT_NO_BONUS|TOTAL_BONUS|
+-------+----------------+-----------+
| APPLE| 12| 0|
| TARGET| 2| 18|
| GOOGLE| 11| 6|
+-------+----------------+-----------+
Upvotes: 1