Evaluate cumulative sum by category and reset when new day starts

Question

This is my PySpark Dataframe:

timestamp                   category   value
2000-10-11 11:00:00+00:00   A          1
2000-10-11 12:00:00+00:00   A          2
2000-10-12 13:00:00+00:00   A          1
2000-10-12 14:00:00+00:00   A          3
2000-10-11 14:00:00+00:00   B          1
2000-10-11 15:00:00+00:00   B          1

I want to get this result (differences between consecutive rows, grouped by feed):

timestamp                   category   value  cum_sum_by_date
2000-10-11 11:00:00+00:00   A          1      1
2000-10-11 12:00:00+00:00   A          2      3
2000-10-12 13:00:00+00:00   A          1      1
2000-10-12 14:00:00+00:00   A          3      4
2000-10-11 14:00:00+00:00   B          1      1
2000-10-11 15:00:00+00:00   B          1      2

I now how to get the cumulative sum just grouped by category, but I can not reset the counter on every new day:

from pyspark.sql import Window
from pyspark.sql import functions as f

w = (Window.partitionBy('category').orderBy('timestamp')
      .rangeBetween(Window.unboundedPreceding, 0))

df = df.withColumn('cum_sum_by_category', f.sum('value').over(w))
df.show()

Evaluate cumulative sum by category and reset when new day starts

Answers (1)

Related Questions