Country
Country

Reputation: 41

Get max value of multiple group by in PySpark

I have a pyspark dataframe with this cols and values:

|employee|department|salary|
+--------+----------+------+
|   James|     Sales| 90000|
| Michael|     Sales| 86000|
|  Robert|     Sales| 81000|
|   Maria|   Finance| 90000|
|   Raman|   Finance| 99000|
|   Scott|   Finance| 83000|
|    Jeff| Marketing| 80000|
|   Kumar| Marketing| 91000|
|   Sarah| Marketing| 95000|

I want to have a dataframe with the employee with the highest salary for each department.

So my result will be :

|employee|department|highest salary|
+--------+----------+---------------+
|   James|     Sales|         90000|
|   Raman|   Finance|         99000|
|   Sarah| Marketing|         95000|

I already tried multiple things with groupby but I don't know what to do.

Thank you for your help.

Upvotes: 1

Views: 2213

Answers (2)

blackbishop
blackbishop

Reputation: 32650

You can use group by and max on struct column to get the highest salary by department with the associated employee like this:

import pyspark.sql.functions as F

result = df.groupBy("department") \
    .agg(F.max(F.struct("salary", "employee")).alias("max")) \
    .selectExpr("max.employee", "department", "max.salary as highest_salary")

result.show()
#+--------+----------+--------------+
#|employee|department|highest_salary|
#+--------+----------+--------------+
#|   Raman|   Finance|         99000|
#|   Sarah| Marketing|         95000|
#|   James|     Sales|         90000|
#+--------+----------+--------------+

Upvotes: 2

过过招
过过招

Reputation: 4189

You can use the analytic functions rank or row_number.

df = df.withColumn('rank', F.expr('rank() over (partition by department order by salary desc)')) \
    .filter('rank=1').drop('rank')
df.show(truncate=False)

Upvotes: 1

Related Questions