How do I find the corresponding value of Column B for the max of column A over a window (not groupBy) in Pyspark?

Question

Is it possible to return the "fav_place" column in pyspark without using a join (see example below)?

+-----+-------+------+        +-----+-------+------+-----------+----------+
|  day|  place|number|        |  day|  place|number| max_number| fav_place|
+-----+-------+------+        +-----+-------+------+-----------+----------+
|  Mon|Place A|    10|        |  Mon|Place A|    10|         42|   Place B|
|  Mon|Place B|    42|  ===>> |  Mon|Place B|    42|         42|   Place B|
| Tues|Place C|    41|        | Tues|Place C|    47|         47|   Place C|
| Tues|Place C|    41|        | Tues|Place D|    41|         47|   Place C|
|Thurs|Place D|    45|        |Thurs|Place D|    45|         45|   Place D|
|  Fri|Place E|    64|        |  Fri|Place E|    64|         64|   Place E|
|  Fri|Place A|    12|        |  Fri|Place A|    12|         64|   Place E|
|  Wed|Place F|    54|        |  Wed|Place F|    54|         54|   Place F|
|  Wed|Place A|     1|        |  Wed|Place A|     7|         54|   Place F|
|  Wed|Place A|     1|        |  Wed|Place C|     1|         54|   Place F|
+-----+-------+------+        +-----+-------+------+-----------+----------+

The max(number) column should be relatively simple to calculate with...

eg_window = Window.partitionBy(['day'])
df.withColumn('max_number', max('number).over(eg_window))

...but I can't figure out how to get the corresponding place without having to do a join

Is it possible? Or is it better to just use a group by + join?

Emma · Accepted Answer

Is it possible?

Yes.

eg_window = Window.partitionBy(['day']).orderBy(desc('number'))
df.select('*', 
          first('number').over(eg_window).alias('max_number'), 
          first('place').over(eg_window).alias('fav_place'))

Or is it better to just use a group by + join?

With sample data, window function was faster, however, maybe that could be due to the way I did join. I'll keep the actual benchmark for you.

How do I find the corresponding value of Column B for the max of column A over a window (not groupBy) in Pyspark?

Answers (1)

Related Questions