Reputation: 17
I have input pyspark dataframe with columns like ID,StartDatetime,EndDatetime. I want to add new column named newdate based on startdatetime and enddatetime.
Input DF :-
ID StartDatetime EndDatetime
1 21-06-2021 07:00 24-06-2021 16:00
2 21-06-2021 07:00 22-06-2021 16:00
required output :-
ID StartDatetime EndDatetime newdate
1 21-06-2021 07:00 24-06-2021 16:00 21-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 22-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 23-06-2021
1 21-06-2021 07:00 24-06-2021 16:00 24-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 21-06-2021
2 21-06-2021 07:00 22-06-2021 16:00 22-06-2021
Upvotes: 0
Views: 972
Reputation: 3419
You can use explode
and array_repeat
to duplicate the rows.
I use a combination of row_number
and date
functions to get the date ranges between start
and end
dates:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("ID").orderBy('StartDatetime')
output_df = df.withColumn("diff", 1+F.datediff(F.to_date(F.unix_timestamp('EndDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')), \
F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp'))))\
.withColumn('diff', F.expr('explode(array_repeat(diff,int(diff)))'))\
.withColumn("diff", F.row_number().over(w))\
.withColumn("start_dt", F.to_date(F.unix_timestamp('StartDatetime', 'dd-MM-yyyy HH:mm').cast('timestamp')))\
.withColumn("newdate", F.date_format(F.expr("date_add(start_dt, diff-1)"), 'dd-MM-yyyy')).drop('diff', 'start_dt')
Output:
output_df.orderBy("ID", "newdate").show()
+---+----------------+----------------+----------+
| ID| StartDatetime| EndDatetime| newdate|
+---+----------------+----------------+----------+
| 1|21-06-2021 07:00|24-06-2021 16:00|21-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|22-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|23-06-2021|
| 1|21-06-2021 07:00|24-06-2021 16:00|24-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|21-06-2021|
| 2|21-06-2021 07:00|22-06-2021 16:00|22-06-2021|
+---+----------------+----------------+----------+
I dropped the diff
column, but displaying it will help you understand the logic if it's not clear.
Upvotes: 2