Find several maximum dates in same group using pyspark

Question

I'm working with time series data in python pyspark. I'd have a situation like this:

name    value    date
Anna      5     1/2020
Anna      6     2/2020
Eve       1     1/2020
Eve       1     2/2020
Anna      2     6/2020

In an ideal case, we'd know the value of Anna and Eve for every month. Anna has reports for Month 1 and 2, but not 3,4,5. I want to find the maximum month of each "subgroup" of consecutive reports for every name. The desire output is thus:

name    date
Anna    2/2020
Anna    6/2020
Eve     2/2020

I've tried tinkering around with window functions and using first() after a max(), but I've not been successful.

Any hints or solutions are appreciated. Thanks.

mck · Accepted Answer

Filter the rows where the next row is more than one month from the current row, or where the next row does not exist (i.e. this row is the last row for that name)

import pyspark.sql.functions as F
from pyspark.sql.window import Window

lead_date = F.lead(F.col('date')).over(Window.partitionBy('name').orderBy('date'))
df = df.withColumn('end',
    (lead_date > F.add_months(F.col('date'), 1)) |
    lead_date.isNull()
)
df = df.filter('end = true').drop('end')

Find several maximum dates in same group using pyspark

Answers (1)

Related Questions