Sky Monster
Sky Monster

Reputation: 53

Pyspark - adding new column with values by using function - group by and max

I've got a scenario where I have to take the results from a group by and max and create a new column:

For example, say I have this data:

|employee_name|department|state|salary|
+-------------+----------+-----+------+
|        James|     Sales|   NY| 90000|
|      Michael|     Sales|   NY| 86000|
|       Robert|     Sales|   CA| 81000|
|        Maria|   Finance|   CA| 90000|
|        Raman|   Finance|   CA| 99000|
|        Scott|   Finance|   NY| 83000|
|         Jeff| Marketing|   CA| 80000|
|        Kumar| Marketing|   NY| 91000|

My output should look like:

|employee_name|department|state|salary|max(salary by department)
+-------------+----------+-----+------+---
|        James|     Sales|   NY| 90000| 90000
|      Michael|     Sales|   NY| 86000| 90000
|       Robert|     Sales|   CA| 81000| 90000
|        Maria|   Finance|   CA| 85000| 88000
|        Raman|   Finance|   CA| 88000| 88000
|        Scott|   Finance|   NY| 83000| 88000
|         Jeff| Marketing|   CA| 80000| 91000
|        Kumar| Marketing|   NY| 91000| 91000

Any tips? Will be of great help.

Upvotes: 0

Views: 1344

Answers (2)

Nidhi
Nidhi

Reputation: 631

You can also use partition instead of groupby.

    df=df.withColumn('max_in_dept',F.max('salary')\
   .over(Window.partitionBy('department')))
    df.show(5,False)

Upvotes: 1

BeyondCode
BeyondCode

Reputation: 1

import pyspark.sql.functions as F

result = df.join(df.groupBy('department').agg(F.max('salary').alias('max_salary')).select('department','max_salary'), 
'department')

Upvotes: 0

Related Questions