Reputation: 37
I have a df which has a date column, and some more columns i want to first get the latest date(max(date)) from the date column. and from that max date, i want to filter the df to only those rows which has the date from last seven days before the max date (including max date).
For example, if the max date is 2021-01-20. then, then filter should keep dates from
2021-01-20(including max date)
2021-01-19
2021-01-18
2021-01-17
2021-01-16
2021-01-15
2021-01-14
and filter out rest of the dates. I want to filter the enire df based on the this date range. I am using spark 3.0 (pyspark)
Upvotes: 1
Views: 6856
Reputation: 42342
You can add a column of the max date, and do a filter to get the rows which are within 7 days of the max date.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'maxdate',
F.max('date').over(Window.orderBy(F.lit(1)))
).filter('date >= maxdate - interval 7 days').drop('maxdate')
Upvotes: 3