Reputation: 607
I have a DataFrame df structured as follows:
date_time id value
2020-12-06 17:00 A 10
2020-12-06 17:05 A 18
2020-12-06 17:00 B 20
2020-12-06 17:05 B 28
2020-12-06 17:00 C 30
2020-12-06 17:05 C 38
And I have to select only the most recent row for each id in a DataFrame named df_last.
This is a solution that works:
from pyspark.sql import functions as F
from pyspark.sql.window import *
df_rows = df.withColumn('row_num', F.row_number().over(Window.partitionBy('id').orderBy(F.desc('date_time')))-1)
df_last = df_rows.filter(F.col('row_num')==0)
I wonder if there is a simpler/cleaner solution
Upvotes: 1
Views: 1897
Reputation: 42342
That's pretty much the way to do it. Just some minor improvements that can be made - no need to subtract 1 from the row number:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_rows = df.withColumn(
'row_num',
F.row_number().over(Window.partitionBy('id').orderBy(F.desc('date_time')))
)
df_last = df_rows.filter('row_num = 1')
Upvotes: 2