Reputation: 53
I've got a scenario where I have to take the results from a group by and max and create a new column:
For example, say I have this data:
|employee_name|department|state|salary|
+-------------+----------+-----+------+
| James| Sales| NY| 90000|
| Michael| Sales| NY| 86000|
| Robert| Sales| CA| 81000|
| Maria| Finance| CA| 90000|
| Raman| Finance| CA| 99000|
| Scott| Finance| NY| 83000|
| Jeff| Marketing| CA| 80000|
| Kumar| Marketing| NY| 91000|
My output should look like:
|employee_name|department|state|salary|max(salary by department)
+-------------+----------+-----+------+---
| James| Sales| NY| 90000| 90000
| Michael| Sales| NY| 86000| 90000
| Robert| Sales| CA| 81000| 90000
| Maria| Finance| CA| 85000| 88000
| Raman| Finance| CA| 88000| 88000
| Scott| Finance| NY| 83000| 88000
| Jeff| Marketing| CA| 80000| 91000
| Kumar| Marketing| NY| 91000| 91000
Any tips? Will be of great help.
Upvotes: 0
Views: 1344
Reputation: 631
You can also use partition instead of groupby.
df=df.withColumn('max_in_dept',F.max('salary')\
.over(Window.partitionBy('department')))
df.show(5,False)
Upvotes: 1
Reputation: 1
import pyspark.sql.functions as F
result = df.join(df.groupBy('department').agg(F.max('salary').alias('max_salary')).select('department','max_salary'),
'department')
Upvotes: 0