Reputation: 3101
I want to apply a Window function to a DataFrame
to get only the latest metrics for every Id. For the following data I expect the df
to contain only the first two records after applying a Window function.
| id | metric | transaction_date |
| 1 | 0.5 | 05-10-2019 |
| 2 | 15.9 | 07-22-2020 |
| 2 | 4.7 | 11-03-2017 |
Is it a correct approach to use row_number
ranking function? My current implementation looks like this:
df.withColumn(
"_row_number",
F.row_number().over(
Window.partitionBy("id").orderBy(F.desc("transaction_date")))
)
.filter(F.col("_row_number") == 1)
.drop("_row_number")
Upvotes: 2
Views: 347
Reputation: 786
You need to first sort the dataframe by id and date (descending). Then you do group by id. The first method on group by object will return the first row (which has the latest date).
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'id':[1,2,2],
'metric':[0.5, 15.9, 4.7],
'date':[datetime(2019,5,10), datetime(2020,7,22), datetime(2017,11,3)]})
## sort df by id and date
df = df.sort_values(['id','date'], ascending= [True, False])
## return the first row of each group
df.groupby('id').first()
Upvotes: 1
Reputation: 1214
val fDF = Seq( (1, 0.5, "05-10-2019"),
(2, 15.9, "07-22-2020"),
(2, 4.7, "11-03-2017"))
.toDF("id", "metric", "transaction_date")
val f1DF = fDF
.withColumn("transaction_date", to_date('transaction_date, "MM-dd-yyyy"))
.orderBy('id.asc,'transaction_date.desc)
val f2DF = f1DF.groupBy("id")
.agg(first('transaction_date).alias("transaction_date"),
first('metric).alias("metric"))
f2DF.show(false)
// +---+----------------+------+
// |id |transaction_date|metric|
// +---+----------------+------+
// |1 |2019-05-10 |0.5 |
// |2 |2020-07-22 |15.9 |
// +---+----------------+------+
Upvotes: 0