Heus
Heus

Reputation: 9

Pyspark rolling sum (rightly aligned)

I would like to use pyspark to do the following; I have a dataframe “cost_df” and I need to first get the lag of 1 of the column called “cost” and then calculate the rolling sum (right align) over a window of size 4 for the same column. I appreciate very much your help.

I tried:

cond=Window.partitionBy(“year”) cond1=Window.partitionBy(“year”).rowsBetween(-3,0)

df_cost=df_cost.withColumn(“cost_calc”, sum(lag(col(“cost”),1).over(cond)).over(cond1))

This does not seem to be correct. Please help.

Upvotes: 0

Views: 37

Answers (2)

Heus
Heus

Reputation: 9

Thank you. And I am sorry for incompleteness. Yes, indeed I tried:

cond=Window.partitionBy(“year”) cond1=Window.partitionBy(“year”).orderBy(“quarter”).rowsBetween(-3,0)

df_cost=df_cost.withColumn(“cost_calc”, sum(lag(col(“cost”),1).over(cond)).over(cond1))

but the results seems to be wrong.

Upvotes: 0

KazM
KazM

Reputation: 382

The operations you're mentioning are dependent on the order of your rows (as you want to get the 'previous one'). Therefore, you have to add orderBy to your window definitions. Maybe this is what you intended with your partitionBy?

Upvotes: 0

Related Questions