Reputation: 39
I want to calculate a rolling sum of an ArrayType column given a unix timestamp and group it by 2 second increments. Example input/output is below. I think the Window() function will work, I'm pretty new to PySpark and am totally lost. Any input is greatly appreciated!
Input:
timestamp vars
2 [1,2,1,2]
2 [1,2,1,2]
3 [1,1,1,2]
4 [1,3,4,2]
5 [1,1,1,3]
6 [1,2,3,5]
9 [1,2,3,5]
Expected output:
+---------+-----------------------+
|timestamp|vars |
+---------+-----------------------+
|2 |[2.0, 4.0, 2.0, 4.0] |
|4 |[4.0, 8.0, 7.0, 8.0] |
|6 |[6.0, 11.0, 11.0, 16.0]|
|10 |[7.0, 13.0, 14.0, 21.0]|
+---------+-----------------------+
Thanks!
Edit: Multiple columns can have the same timestamp/they might not be consecutive. The length of vars may also be > 3. Looking for a slightly generic solution please.
Upvotes: 2
Views: 1617
Reputation: 49270
Using sum
window function to compute the running sum and row_number
to pick every second timestamp row.
from pyspark.sql import Window
w = Window.orderBy(col('timestamp'))
result = df.withColumn('summed_vars',array([sum(col('vars')[i]).over(w) for i in range(3)])) #change the value 3 as desired
result.filter(col('rnum')%2 == 0).select('timestamp','summed_vars').show()
Change the %2
as needed per your time interval.
Edit: Grouping by time intervals with window
. Assuming timestamp
column is of data type timestamp
.
from pyspark.sql import Window
from pyspark.sql.functions import window,sum,row_number,array,col
w = Window.orderBy(col('timestamp'))
result = df.withColumn('timestamp_interval',window(col('timestamp'),'2 second')) \
.withColumn('summed_vars',array(*[sum(col('vars')[i]).over(w) for i in range(4)]))
w1 = Window.partitionBy(col('timestamp_interval')).orderBy(col('timestamp').desc())
final_result = result.withColumn('rnum',row_number().over(w1))
final_result.filter(col('rnum')==1).drop(*['rnum','vars']).show()
Upvotes: 1
Reputation: 32720
For Spark 2.4+ you can use array functions and higher-order functions. This solution will work for different array sizes (event if different between each row). Here are the steps explained:
First, group by 2 seconds and collect the vars
in an array column :
df = df.groupBy((ceil(col("timestamp") / 2) * 2).alias("timestamp")) \
.agg(collect_list(col("vars")).alias("vars"))
df.show()
#+---------+----------------------+
#|timestamp|vars |
#+---------+----------------------+
#|6 |[[1, 1, 1], [1, 2, 3]]|
#|2 |[[1, 1, 1], [1, 2, 1]]|
#|4 |[[1, 1, 1], [1, 3, 4]]|
#+---------+----------------------+
Here we grouped each consecutive 2 seconds and collected the vars
arrays into a new list.
Now, using a Window spec you can collect cumulative values and use flatten
to flatten the sub arrays:
w = Window.orderBy("timestamp").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn("vars", flatten(collect_list(col("vars")).over(w)))
df.show()
#+---------+------------------------------------------------------------------+
#|timestamp|vars |
#+---------+------------------------------------------------------------------+
#|2 |[[1, 1, 1], [1, 2, 1]] |
#|4 |[[1, 1, 1], [1, 2, 1], [1, 1, 1], [1, 3, 4]] |
#|6 |[[1, 1, 1], [1, 2, 1], [1, 1, 1], [1, 3, 4], [1, 1, 1], [1, 2, 3]]|
#+---------+------------------------------------------------------------------+
Finally, use aggregate
function with zip_with
to sum the arrays :
t = "aggregate(vars, cast(array() as array<double>), (acc, a) -> zip_with(acc, a, (x, y) -> coalesce(x, 0) + coalesce(y, 0)))"
df.withColumn("vars", expr(t)).show(truncate=False)
#+---------+-----------------+
#|timestamp|vars |
#+---------+-----------------+
#|2 |[2.0, 3.0, 2.0] |
#|4 |[4.0, 7.0, 7.0] |
#|6 |[6.0, 10.0, 11.0]|
#+---------+-----------------+
Putting all together:
from pyspark.sql.functions import ceil, col, collect_list, flatten, expr
from pyspark.sql import Window
w = Window.orderBy("timestamp").rowsBetween(Window.unboundedPreceding, Window.currentRow)
t = "aggregate(vars, cast(array() as array<double>), (acc, a) -> zip_with(acc, a, (x, y) -> coalesce(x, 0) + coalesce(y, 0)))"
nb_seconds = 2
df.groupBy((ceil(col("timestamp") / nb_seconds) * nb_seconds).alias("timestamp")) \
.agg(collect_list(col("vars")).alias("vars")) \
.withColumn("vars", flatten(collect_list(col("vars")).over(w))) \
.withColumn("vars", expr(t)).show(truncate=False)
Upvotes: 2