Calculate rolling sum of an array in PySpark using Window()?

Question

I want to calculate a rolling sum of an ArrayType column given a unix timestamp and group it by 2 second increments. Example input/output is below. I think the Window() function will work, I'm pretty new to PySpark and am totally lost. Any input is greatly appreciated!

Input:

timestamp     vars 
2             [1,2,1,2]
2             [1,2,1,2]
3             [1,1,1,2]
4             [1,3,4,2]
5             [1,1,1,3]
6             [1,2,3,5]
9             [1,2,3,5]

Expected output:

+---------+-----------------------+
|timestamp|vars                   |
+---------+-----------------------+
|2        |[2.0, 4.0, 2.0, 4.0]   |
|4        |[4.0, 8.0, 7.0, 8.0]   |
|6        |[6.0, 11.0, 11.0, 16.0]|
|10       |[7.0, 13.0, 14.0, 21.0]|
+---------+-----------------------+

Thanks!

Edit: Multiple columns can have the same timestamp/they might not be consecutive. The length of vars may also be > 3. Looking for a slightly generic solution please.

Vamsi Prabhala · Accepted Answer

Using sum window function to compute the running sum and row_number to pick every second timestamp row.

from pyspark.sql import Window
w = Window.orderBy(col('timestamp'))
result = df.withColumn('summed_vars',array([sum(col('vars')[i]).over(w) for i in range(3)])) #change the value 3 as desired
result.filter(col('rnum')%2 == 0).select('timestamp','summed_vars').show()

Change the %2 as needed per your time interval.

Edit: Grouping by time intervals with window. Assuming timestamp column is of data type timestamp.

from pyspark.sql import Window
from pyspark.sql.functions import window,sum,row_number,array,col 
w = Window.orderBy(col('timestamp'))
result = df.withColumn('timestamp_interval',window(col('timestamp'),'2 second')) \
           .withColumn('summed_vars',array(*[sum(col('vars')[i]).over(w) for i in range(4)])) 
w1 = Window.partitionBy(col('timestamp_interval')).orderBy(col('timestamp').desc())
final_result = result.withColumn('rnum',row_number().over(w1))
final_result.filter(col('rnum')==1).drop(*['rnum','vars']).show()

Calculate rolling sum of an array in PySpark using Window()?

Answers (2)

Related Questions