Pyspark - generate a dates column having all the days between two given dates and add it to an existing dataframe

Question

I am relatively new to pyspark. I want to generate a dataframe column with dates between two given dates (constants) and add this column to an existing dataframe. What will be the efficient way?

I tried this but it didn't work:

df_add_column = df.withColumn("repeat", expr("split(repeat(',', diffDays), ',')")).select("*", posexplode("repeat").alias('DATE', "val")) .drop("repeat", "val", "diffDays").withColumn('DATE', expr("date_add('2018-01-01', 'DATE')"))

blackbishop · Accepted Answer

You can use sequence function to generate the dates then explode. Example:

from pyspark.sql import functions as F

df = spark.createDataFrame([(1,)], ["id"])

df1 = df.withColumn(
    "date", 
    F.explode(F.expr("sequence(to_date('2021-02-01'), to_date('2021-02-08'), interval 1 day)"))
)

df1.show()

#+---+----------+
#| id|      date|
#+---+----------+
#|  1|2021-02-01|
#|  1|2021-02-02|
#|  1|2021-02-03|
#|  1|2021-02-04|
#|  1|2021-02-05|
#|  1|2021-02-06|
#|  1|2021-02-07|
#|  1|2021-02-08|
#+---+----------+

Pyspark - generate a dates column having all the days between two given dates and add it to an existing dataframe

Answers (1)

Related Questions