Deepak Kumar
Deepak Kumar

Reputation: 17

Need to add date ranges between two date columns in pyspark?

I have input pyspark dataframe with columns like ID,StartDatetime,EndDatetime. I want to add new column named newdate based on startdatetime and enddatetime.

Input DF :-

ID  StartDatetime       EndDatetime
1   21-06-2021 07:00    24-06-2021 16:00
2   21-06-2021 07:00    22-06-2021 16:00

required output :-

ID  StartDatetime       EndDatetime         newdate
1   21-06-2021 07:00    24-06-2021 16:00    21-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    22-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    23-06-2021
1   21-06-2021 07:00    24-06-2021 16:00    24-06-2021
2   21-06-2021 07:00    22-06-2021 16:00    21-06-2021
2   21-06-2021 07:00    22-06-2021 16:00    22-06-2021

Upvotes: 0

Views: 972

Answers (1)

Cena
Cena

Reputation: 3419

You can use explode and array_repeat to duplicate the rows.

I use a combination of row_number and date functions to get the date ranges between start and end dates:

from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window().partitionBy("ID").orderBy('StartDatetime')

output_df = df.withColumn("diff", 1+F.datediff(F.to_date(F.unix_timestamp('EndDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')), \
                                               F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp'))))\
              .withColumn('diff', F.expr('explode(array_repeat(diff,int(diff)))'))\
              .withColumn("diff", F.row_number().over(w))\
              .withColumn("start_dt", F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')))\
              .withColumn("newdate", F.date_format(F.expr("date_add(start_dt, diff-1)"), 'dd-MM-yyyy')).drop('diff', 'start_dt')

Output:

output_df.orderBy("ID", "newdate").show()
+---+----------------+----------------+----------+
| ID|   StartDatetime|     EndDatetime|   newdate|
+---+----------------+----------------+----------+
|  1|21-06-2021 07:00|24-06-2021 16:00|21-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|22-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|23-06-2021|
|  1|21-06-2021 07:00|24-06-2021 16:00|24-06-2021|
|  2|21-06-2021 07:00|22-06-2021 16:00|21-06-2021|
|  2|21-06-2021 07:00|22-06-2021 16:00|22-06-2021|
+---+----------------+----------------+----------+

I dropped the diff column, but displaying it will help you understand the logic if it's not clear.

Upvotes: 2

Related Questions