ritzen101
ritzen101

Reputation: 37

filter pyspark df by max date and date range

I have a df which has a date column, and some more columns i want to first get the latest date(max(date)) from the date column. and from that max date, i want to filter the df to only those rows which has the date from last seven days before the max date (including max date).

For example, if the max date is 2021-01-20. then, then filter should keep dates from

2021-01-20(including max date)

2021-01-19

2021-01-18

2021-01-17

2021-01-16

2021-01-15

2021-01-14

and filter out rest of the dates. I want to filter the enire df based on the this date range. I am using spark 3.0 (pyspark)

Upvotes: 1

Views: 6856

Answers (1)

mck
mck

Reputation: 42342

You can add a column of the max date, and do a filter to get the rows which are within 7 days of the max date.

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'maxdate', 
    F.max('date').over(Window.orderBy(F.lit(1)))
).filter('date >= maxdate - interval 7 days').drop('maxdate')

Upvotes: 3

Related Questions