user3292373
user3292373

Reputation: 533

Pyspark column generation on lookup of previous rows and compute

Need to generate the column dynamically with look up on previous row values.

So far tried on Code which is shared below

Spark Dataframe is below

cat a    b    c
1   null 0    0
1   0    9    0 
2   0    126  0
2   7    0    0
2   36   0    0
3   35   0    0

Output should be as below

cat a    b    c    d
1   null 0    0    null
1   0    9    0    0
2   0    126  0    9
2   7    0    0    119
2   36   0    0    83 
2   35   0    0    48

Here column d is dynamically generated below is the code used and not working

w=Window().partitionBy().orderBy('cat')
df=df.withColumn('d',lag("b").over(w)-df.a+df.c)
df.show()

Problem is formula is not getting replicated prperly formula is prev(b)-a+c

Upvotes: 1

Views: 2173

Answers (1)

Prem
Prem

Reputation: 11955

import pyspark.sql.functions as f
from pyspark.sql.window import Window

df = sc.parallelize([
    [1,None, 0, 0],
    [1,0,9,0],
    [2,0,126,0],
    [2,7,0,0],
    [2,36,0,0],
    [2,35,0,0]
]).toDF(('cat','a', 'b', 'c'))

df1 = df.withColumn("row_id", f.monotonically_increasing_id())
w = Window.partitionBy().orderBy(f.col("row_id"))
df1 = df1.withColumn("d", f.lag("b").over(w)- f.col("a") + f.col("c")).drop("row_id")
df1.show()

Output is:

+---+----+---+---+----+
|cat|   a|  b|  c|   d|
+---+----+---+---+----+
|  1|null|  0|  0|null|
|  1|   0|  9|  0|   0|
|  2|   0|126|  0|   9|
|  2|   7|  0|  0| 119|
|  2|  36|  0|  0| -36|
|  2|  35|  0|  0| -35|
+---+----+---+---+----+

Upvotes: 3

Related Questions