Reputation: 131
I would like to create a column on my spark dataframe with operations on two columns.
I want to create the column Areas
which is calculated with the formula:
( (Pct_Buenos_Acum[i]-Pct_Buenos_Acum[i-1]) * (Pct_Malos_Acum[i]+Pct_Malos_Acum[i-1]) ) / 2
I have tried this:
w = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)
df= df.withColumn('Areas', (( ( col('Pct_Acum_buenos')-col('Pct_Acum_buenos' ) )*(col('Pct_Acum_malos')+col('Pct_Acum_malos')))/2).over(w))
Find attached a print of what I have so far
Upvotes: 0
Views: 755
Reputation: 1078
Here is a way to access previous values in pySpark. Going by that.
from pyspark.sql import functions as F
# adding indexs column to use in order by
df = df.withColumn('index', F.monotonicallyIncreasingId)
w = Window.partitionBy().orderBy('index')
df = df.withColumn('Areas', (((col('Pct_Acum_buenos')-F.lag(col('Pct_Acum_buenos')).over(w))*(col('Pct_Acum_malos')+F.lag(col('Pct_Acum_malos')).over(w)))/2)
Upvotes: 1