Reputation: 253
I'm working with time series data in python pyspark. I'd have a situation like this:
name value date
Anna 5 1/2020
Anna 6 2/2020
Eve 1 1/2020
Eve 1 2/2020
Anna 2 6/2020
In an ideal case, we'd know the value of Anna and Eve for every month. Anna has reports for Month 1 and 2, but not 3,4,5. I want to find the maximum month of each "subgroup" of consecutive reports for every name. The desire output is thus:
name date
Anna 2/2020
Anna 6/2020
Eve 2/2020
I've tried tinkering around with window functions and using first() after a max(), but I've not been successful.
Any hints or solutions are appreciated. Thanks.
Upvotes: 0
Views: 64
Reputation: 42342
Filter the rows where the next row is more than one month from the current row, or where the next row does not exist (i.e. this row is the last row for that name)
import pyspark.sql.functions as F
from pyspark.sql.window import Window
lead_date = F.lead(F.col('date')).over(Window.partitionBy('name').orderBy('date'))
df = df.withColumn('end',
(lead_date > F.add_months(F.col('date'), 1)) |
lead_date.isNull()
)
df = df.filter('end = true').drop('end')
Upvotes: 2