Reputation: 51
I have a pyspark dataframe df
---------------------------------------------------------
primaryKey | start_timestamp | end_timestamp
---------------------------------------------------------
key1 | 2020-08-13 15:40:00 | 2020-08-13 15:44:47
key2 | 2020-08-14 12:00:00 | 2020-08-14 12:01:13
I want to create a dataframe that will have a timeseries that is between start_timestamp and end_timestamp for all keys at a gap of x seconds. For example for a gap of x = 120 seconds the output will be as:-
-----------------------------------------------------------
primaryKey | start_timestamp_new | end_timestamp_new
key1 | 2020-08-13 15:40:00 | 2020-08-13 15:41:59
key1 | 2020-08-13 15:42:00 | 2020-08-13 15:43:59
key1 | 2020-08-13 15:44:00 | 2020-08-13 15:45:59
key2 | 2020-08-14 12:00:00 | 2020-08-14 12:01:59
I am trying to use the approach mentiond here, but unable to apply it to a spark dataframe.
Any info on creating this will be a huge help.
Upvotes: 0
Views: 195
Reputation: 13581
You can use sequence
function.
x = 120
df.withColumn('start_timestamp', to_timestamp('start_timestamp')) \
.withColumn('end_timestamp', to_timestamp('end_timestamp')) \
.withColumn('start_timestamp', explode(sequence('start_timestamp', 'end_timestamp', expr(f'interval {x} seconds')))) \
.withColumn('end_timestamp', col('start_timestamp') + expr(f'interval {x - 1} seconds')) \
.show()
+----------+-------------------+-------------------+
|primaryKey| start_timestamp| end_timestamp|
+----------+-------------------+-------------------+
| key1|2020-08-13 15:40:00|2020-08-13 15:41:59|
| key1|2020-08-13 15:42:00|2020-08-13 15:43:59|
| key1|2020-08-13 15:44:00|2020-08-13 15:45:59|
| key2|2020-08-14 12:00:00|2020-08-14 12:01:59|
+----------+-------------------+-------------------+
Upvotes: 2