Ashwini
Ashwini

Reputation: 38

DataBricks Pyspark DATEADD

I'm trying to filter out data from the current date to last 3 years and trying to use this for Pyspark dataframe.

below sql query needs to convert into Pyspark dataframe format

date >= dateadd(month,-4,current_date)

How to write above sql in Pyspark dataframe format

Upvotes: 0

Views: 238

Answers (2)

Alex Ott
Alex Ott

Reputation: 87164

You need to use a combination of the filter and add_months functions, like this:

from pyspark.sql import functions as F

df = df.filter(F.col("date") >= F.add_months(F.current_date(), -4))

Upvotes: 0

Chen Hirsh
Chen Hirsh

Reputation: 1390

Lets create a dataframe to test on:

from datetime import date,timedelta
from dateutil.relativedelta import relativedelta
start_date = date(2023,1,1)
dates_list = []
for i in range(0,12):
    curr_date = start_date + relativedelta(months=i)
    print(curr_date)
    dates_list.append((i+1,curr_date))
columns =["id","date"]
dates_df = spark.createDataFrame(dates_list,columns)

These create a dataframe with 12 rows, each for the 1 day of the month in 2023:

display(dates_df)

dates_df

Now lets filter the dataframe into a new dataframe:

filter_dates_df = dates_df.where("date>=dateadd(month,-4,current_date)")

And the results:

display(filter_dates_df)

filter_dates_df

Upvotes: 0

Related Questions