Reputation: 55
Is it possible to return the "fav_place" column in pyspark without using a join (see example below)?
+-----+-------+------+ +-----+-------+------+-----------+----------+
| day| place|number| | day| place|number| max_number| fav_place|
+-----+-------+------+ +-----+-------+------+-----------+----------+
| Mon|Place A| 10| | Mon|Place A| 10| 42| Place B|
| Mon|Place B| 42| ===>> | Mon|Place B| 42| 42| Place B|
| Tues|Place C| 41| | Tues|Place C| 47| 47| Place C|
| Tues|Place C| 41| | Tues|Place D| 41| 47| Place C|
|Thurs|Place D| 45| |Thurs|Place D| 45| 45| Place D|
| Fri|Place E| 64| | Fri|Place E| 64| 64| Place E|
| Fri|Place A| 12| | Fri|Place A| 12| 64| Place E|
| Wed|Place F| 54| | Wed|Place F| 54| 54| Place F|
| Wed|Place A| 1| | Wed|Place A| 7| 54| Place F|
| Wed|Place A| 1| | Wed|Place C| 1| 54| Place F|
+-----+-------+------+ +-----+-------+------+-----------+----------+
The max(number) column should be relatively simple to calculate with...
eg_window = Window.partitionBy(['day'])
df.withColumn('max_number', max('number).over(eg_window))
...but I can't figure out how to get the corresponding place without having to do a join
Is it possible? Or is it better to just use a group by + join?
Upvotes: 0
Views: 29
Reputation: 9363
Is it possible?
Yes.
eg_window = Window.partitionBy(['day']).orderBy(desc('number'))
df.select('*',
first('number').over(eg_window).alias('max_number'),
first('place').over(eg_window).alias('fav_place'))
Or is it better to just use a group by + join?
With sample data, window function was faster, however, maybe that could be due to the way I did join. I'll keep the actual benchmark for you.
Upvotes: 1