PySpark select only the first timestamp that is after another column timestamp of the same group

Question

I have the following dataset:

 ID Timestamp1             Timestamp2
 1  2022-01-01T12:00:00    2022-01-01T12:30:00
 1  2022-01-01T13:00:00    2022-01-01T12:30:00
 1  2022-01-01T14:00:00    2022-01-01T12:30:00
 2  2022-02-02T09:00:00    2022-02-02T10:15:00
 2  2022-02-02T10:00:00    2022-02-02T10:15:00
 2  2022-02-03T11:00:00    2022-02-02T10:15:00

Each unique ID will always have the same exact value in Timestamp2 and different value in Timestamp1. The goal is to get the first timestamp in the column Timestamp1 that is after the value of timestamp2 for each ID. The results will look like this:

 ID Timestamp1             Timestamp2
 1  2022-01-01T13:00:00    2022-01-01T12:30:00
 2  2022-02-03T11:00:00    2022-02-02T10:15:00

I am not sure if in this case the best solution is to use a window function or if this can be done directly with a groupby function.

EDIT I actually found a solution but I don't think it is the best so I am still looking for a better one. My idea is to create a new column with the difference between Timestamp2 column and Timestamp1 column. I can then remove all the negative values and finally use a groupby on ID and takin the minimum of each timestamp1.

walking · Accepted Answer

what about the following simple group by? it's only one shuffle operation (the group by) so I don't think a window function can really help optimize performance.

df \
.filter(col('Timestamp1') > col('Timestamp2')) \
.groupBy('Timestamp2') \
.agg(min('Timestamp1').alias('Timestamp1'))

PySpark select only the first timestamp that is after another column timestamp of the same group

Answers (2)

Related Questions