Reputation: 533
Need to generate the column dynamically with look up on previous row values.
So far tried on Code which is shared below
Spark Dataframe is below
cat a b c
1 null 0 0
1 0 9 0
2 0 126 0
2 7 0 0
2 36 0 0
3 35 0 0
Output should be as below
cat a b c d
1 null 0 0 null
1 0 9 0 0
2 0 126 0 9
2 7 0 0 119
2 36 0 0 83
2 35 0 0 48
Here column d
is dynamically generated below is the code used and not working
w=Window().partitionBy().orderBy('cat')
df=df.withColumn('d',lag("b").over(w)-df.a+df.c)
df.show()
Problem is formula is not getting replicated prperly formula is prev(b)-a+c
Upvotes: 1
Views: 2173
Reputation: 11955
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df = sc.parallelize([
[1,None, 0, 0],
[1,0,9,0],
[2,0,126,0],
[2,7,0,0],
[2,36,0,0],
[2,35,0,0]
]).toDF(('cat','a', 'b', 'c'))
df1 = df.withColumn("row_id", f.monotonically_increasing_id())
w = Window.partitionBy().orderBy(f.col("row_id"))
df1 = df1.withColumn("d", f.lag("b").over(w)- f.col("a") + f.col("c")).drop("row_id")
df1.show()
Output is:
+---+----+---+---+----+
|cat| a| b| c| d|
+---+----+---+---+----+
| 1|null| 0| 0|null|
| 1| 0| 9| 0| 0|
| 2| 0|126| 0| 9|
| 2| 7| 0| 0| 119|
| 2| 36| 0| 0| -36|
| 2| 35| 0| 0| -35|
+---+----+---+---+----+
Upvotes: 3