Reputation: 9
I would like to use pyspark to do the following; I have a dataframe “cost_df” and I need to first get the lag of 1 of the column called “cost” and then calculate the rolling sum (right align) over a window of size 4 for the same column. I appreciate very much your help.
I tried:
cond=Window.partitionBy(“year”) cond1=Window.partitionBy(“year”).rowsBetween(-3,0)
df_cost=df_cost.withColumn(“cost_calc”, sum(lag(col(“cost”),1).over(cond)).over(cond1))
This does not seem to be correct. Please help.
Upvotes: 0
Views: 37
Reputation: 9
Thank you. And I am sorry for incompleteness. Yes, indeed I tried:
cond=Window.partitionBy(“year”) cond1=Window.partitionBy(“year”).orderBy(“quarter”).rowsBetween(-3,0)
df_cost=df_cost.withColumn(“cost_calc”, sum(lag(col(“cost”),1).over(cond)).over(cond1))
but the results seems to be wrong.
Upvotes: 0
Reputation: 382
The operations you're mentioning are dependent on the order of your rows (as you want to get the 'previous one'). Therefore, you have to add orderBy to your window definitions. Maybe this is what you intended with your partitionBy?
Upvotes: 0