Reputation: 181
I am struggling with a (Py)Spark problem.
I have a column "col" in an ordered dataframe and need a way of adding up the elements from 0. What I need is the column "sum_from_0". I tried it with window functions but did not succeed. Any idea on how to solve this task would be appreciated. Thank you in advance.
col sum_from_0
0 None
0 None
1 1
2 3
1 4
4 8
3 11
0 None
0 None
0 None
1 1
2 3
3 6
3 9
2 11
0 None
0 None
Upvotes: 0
Views: 353
Reputation: 13591
There is no ordering column, so I made it first and add some temp columns to separate sum groups. After that, sum over the group
partition and order by id
window such as
import org.apache.spark.sql.expressions.Window
val w1 = Window.orderBy("id")
val w2 = Window.partitionBy("group").orderBy("id")
df.withColumn("id", monotonically_increasing_id)
.withColumn("zero", (col("col") === 0).cast("int"))
.withColumn("group", sum("zero").over(w1))
.withColumn("sum_from_0", sum("col").over(w2))
.orderBy("id")
.drop("id", "group", "zero")
.show(20, false)
that gives the results:
+---+----------+
|col|sum_from_0|
+---+----------+
|0 |0 |
|0 |0 |
|1 |1 |
|2 |3 |
|1 |4 |
|4 |8 |
|3 |11 |
|0 |0 |
|0 |0 |
|0 |0 |
|1 |1 |
|2 |3 |
|3 |6 |
|3 |9 |
|2 |11 |
|0 |0 |
|0 |0 |
+---+----------+
Upvotes: 3