Pyspark column generation on lookup of previous rows and compute

Question

Need to generate the column dynamically with look up on previous row values.

So far tried on Code which is shared below

Spark Dataframe is below

cat a    b    c
1   null 0    0
1   0    9    0 
2   0    126  0
2   7    0    0
2   36   0    0
3   35   0    0

Output should be as below

cat a    b    c    d
1   null 0    0    null
1   0    9    0    0
2   0    126  0    9
2   7    0    0    119
2   36   0    0    83 
2   35   0    0    48

Here column d is dynamically generated below is the code used and not working

w=Window().partitionBy().orderBy('cat')
df=df.withColumn('d',lag("b").over(w)-df.a+df.c)
df.show()

Problem is formula is not getting replicated prperly formula is prev(b)-a+c

Prem · Accepted Answer

import pyspark.sql.functions as f
from pyspark.sql.window import Window

df = sc.parallelize([
    [1,None, 0, 0],
    [1,0,9,0],
    [2,0,126,0],
    [2,7,0,0],
    [2,36,0,0],
    [2,35,0,0]
]).toDF(('cat','a', 'b', 'c'))

df1 = df.withColumn("row_id", f.monotonically_increasing_id())
w = Window.partitionBy().orderBy(f.col("row_id"))
df1 = df1.withColumn("d", f.lag("b").over(w)- f.col("a") + f.col("c")).drop("row_id")
df1.show()

Output is:

+---+----+---+---+----+
|cat|   a|  b|  c|   d|
+---+----+---+---+----+
|  1|null|  0|  0|null|
|  1|   0|  9|  0|   0|
|  2|   0|126|  0|   9|
|  2|   7|  0|  0| 119|
|  2|  36|  0|  0| -36|
|  2|  35|  0|  0| -35|
+---+----+---+---+----+

Pyspark column generation on lookup of previous rows and compute

Answers (1)

Related Questions