Reputation: 41
I have a pyspark dataframe with this cols and values:
|employee|department|salary|
+--------+----------+------+
| James| Sales| 90000|
| Michael| Sales| 86000|
| Robert| Sales| 81000|
| Maria| Finance| 90000|
| Raman| Finance| 99000|
| Scott| Finance| 83000|
| Jeff| Marketing| 80000|
| Kumar| Marketing| 91000|
| Sarah| Marketing| 95000|
I want to have a dataframe with the employee
with the highest salary
for each department
.
So my result will be :
|employee|department|highest salary|
+--------+----------+---------------+
| James| Sales| 90000|
| Raman| Finance| 99000|
| Sarah| Marketing| 95000|
I already tried multiple things with groupby
but I don't know what to do.
Thank you for your help.
Upvotes: 1
Views: 2213
Reputation: 32650
You can use group by and max
on struct column to get the highest salary
by department
with the associated employee
like this:
import pyspark.sql.functions as F
result = df.groupBy("department") \
.agg(F.max(F.struct("salary", "employee")).alias("max")) \
.selectExpr("max.employee", "department", "max.salary as highest_salary")
result.show()
#+--------+----------+--------------+
#|employee|department|highest_salary|
#+--------+----------+--------------+
#| Raman| Finance| 99000|
#| Sarah| Marketing| 95000|
#| James| Sales| 90000|
#+--------+----------+--------------+
Upvotes: 2
Reputation: 4189
You can use the analytic functions rank
or row_number
.
df = df.withColumn('rank', F.expr('rank() over (partition by department order by salary desc)')) \
.filter('rank=1').drop('rank')
df.show(truncate=False)
Upvotes: 1