Reputation: 38
I'm trying to filter out data from the current date to last 3 years and trying to use this for Pyspark dataframe.
below sql query needs to convert into Pyspark dataframe format
date >= dateadd(month,-4,current_date)
How to write above sql in Pyspark dataframe format
Upvotes: 0
Views: 238
Reputation: 87164
You need to use a combination of the filter and add_months functions, like this:
from pyspark.sql import functions as F
df = df.filter(F.col("date") >= F.add_months(F.current_date(), -4))
Upvotes: 0
Reputation: 1390
Lets create a dataframe to test on:
from datetime import date,timedelta
from dateutil.relativedelta import relativedelta
start_date = date(2023,1,1)
dates_list = []
for i in range(0,12):
curr_date = start_date + relativedelta(months=i)
print(curr_date)
dates_list.append((i+1,curr_date))
columns =["id","date"]
dates_df = spark.createDataFrame(dates_list,columns)
These create a dataframe with 12 rows, each for the 1 day of the month in 2023:
display(dates_df)
Now lets filter the dataframe into a new dataframe:
filter_dates_df = dates_df.where("date>=dateadd(month,-4,current_date)")
And the results:
display(filter_dates_df)
Upvotes: 0