Marco
Marco

Reputation: 1235

PySpark select only the first timestamp that is after another column timestamp of the same group

I have the following dataset:

 ID Timestamp1             Timestamp2
 1  2022-01-01T12:00:00    2022-01-01T12:30:00
 1  2022-01-01T13:00:00    2022-01-01T12:30:00
 1  2022-01-01T14:00:00    2022-01-01T12:30:00
 2  2022-02-02T09:00:00    2022-02-02T10:15:00
 2  2022-02-02T10:00:00    2022-02-02T10:15:00
 2  2022-02-03T11:00:00    2022-02-02T10:15:00

Each unique ID will always have the same exact value in Timestamp2 and different value in Timestamp1. The goal is to get the first timestamp in the column Timestamp1 that is after the value of timestamp2 for each ID. The results will look like this:

 ID Timestamp1             Timestamp2
 1  2022-01-01T13:00:00    2022-01-01T12:30:00
 2  2022-02-03T11:00:00    2022-02-02T10:15:00

I am not sure if in this case the best solution is to use a window function or if this can be done directly with a groupby function.

EDIT I actually found a solution but I don't think it is the best so I am still looking for a better one. My idea is to create a new column with the difference between Timestamp2 column and Timestamp1 column. I can then remove all the negative values and finally use a groupby on ID and takin the minimum of each timestamp1.

Upvotes: 0

Views: 161

Answers (2)

PieCot
PieCot

Reputation: 3629

Something like this?

from pyspark.sql import functions as F, types as T, Window

(
    df
    # Keep only rows where Timestamp1 is greater than Timestamp2
    .where(F.col('Timestamp1') > F.col('Timestamp2'))
    # add a column with the rank based on Timestamp1 for each ID
    .withColumn('rank', F.rank().over(Window.partitionBy('ID').orderBy('Timestamp1')))
    # Keep only rows with rank equal to one, i.e. the ones where Timestamp1 is the closest to Timestamp2
    .where(F.col('rank') == 1)
    .drop('rank')
)

The result is:

+---+-------------------+-------------------+
|ID |Timestamp1         |Timestamp2         |
+---+-------------------+-------------------+
|1  |2022-01-01 13:00:00|2022-01-01 12:30:00|
|2  |2022-02-03 11:00:00|2022-02-02 10:15:00|
+---+-------------------+-------------------+

Upvotes: 0

walking
walking

Reputation: 960

what about the following simple group by? it's only one shuffle operation (the group by) so I don't think a window function can really help optimize performance.

df \
.filter(col('Timestamp1') > col('Timestamp2')) \
.groupBy('Timestamp2') \
.agg(min('Timestamp1').alias('Timestamp1'))

Upvotes: 2

Related Questions