Federico Rizzo
Federico Rizzo

Reputation: 302

How to fill gaps between two rows having the difference expressed in days

I have the following dataframe where diff_days is the difference between one row and the previous row

+----------+--------+---------+
|   fx_date|  col_1 |diff_days|
+----------+--------+---------+
|2020-01-05|       A|     null|
|2020-01-09|       B|        4|
|2020-01-11|       C|        2|
+----------+--------+---------+

I want to get a dataframe adding rows with missing dates and replicated values of col_1 related to the first row. It should be:

+----------+--------+
|   fx_date|  col_1 |
+----------+--------+
|2020-01-05|       A|
|2020-01-06|       A|
|2020-01-07|       A|
|2020-01-08|       A|
|2020-01-09|       B|
|2020-01-10|       B|
|2021-01-11|       C|
+----------+--------+

Upvotes: 2

Views: 71

Answers (1)

blackbishop
blackbishop

Reputation: 32700

You can use lag + sequence functions to generate the dates between previous and current row dates, then explode the list like this:

from pyspark.sql import functions as F, Window

df1 = df.withColumn(
    "previous_dt",
    F.date_add(F.lag("fx_date", 1).over(Window.orderBy("fx_date")), 1)
).withColumn(
    "fx_date",
    F.expr("sequence(coalesce(previous_dt, fx_date), fx_date, interval 1 day)")
).withColumn(
    "fx_date",
    F.explode("fx_date")
).drop("previous_dt", "diff_days")

df1.show()
#+----------+-----+
#|   fx_date|col_1|
#+----------+-----+
#|2020-01-05|    A|
#|2020-01-06|    B|
#|2020-01-07|    B|
#|2020-01-08|    B|
#|2020-01-09|    B|
#|2020-01-10|    C|
#|2020-01-11|    C|
#+----------+-----+

Upvotes: 2

Related Questions