Reputation: 31
I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
Upvotes: 1
Views: 784
Reputation: 12522
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc)
which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+
Upvotes: 1