pyspark variable not defined error using window function in dataframe select operation

Question

I have sample input dataframe as below, but the value (clm starting with m) columns can be n number.

customer_id|field_id|month_id|m1  |m2
1001       |  10    |01      |10  |20    
1002       |  20    |01      |20  |30    
1003       |  30    |01      |30  |40
1001       |  10    |02      |40  |50    
1002       |  20    |02      |50  |60    
1003       |  30    |02      |60  |70
1001       |  10    |03      |70  |80    
1002       |  20    |03      |80  |90    
1003       |  30    |03      |90  |100

I have to create new columns based on the cumulative sum of m1 and m2. Have used windows function to acheive that. But, I have got some weird problem as shown below:

Code Tried:

partiton_list = ["customer_id", "field_id"]
# Preparing the window function
window_num = (Window.partitionBy(partiton_list).orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
# Prepare the new columns expression
n1_list_expr = ["F.sum(F.col('m1')).over(window_num).alias('n1')", "F.sum(F.col('m2')).over(window_num).alias('n2')"]
#Evaluated the expression using eval to process column by column in select
new_n1_list_expr = [eval(x) for x in n1_list_expr]
#Getting column list of the source dataframe
df_col = df.columns
# Appending the new columns expression
df_col.append(new_n1_list_expr)
#Doing the select to create/calculate the new columns
df = df.select([x for x in df_col])

But the program is failing at the eval statement

with below error:
File "", line 1, 
NameError: name 'window_num' is not defined

Unknowing, also when I tried seperately the code is working, but when try it as a common module in a function then the block of code is failing with above error. I'm not getting why it is not able to find the window over variable????

Expected Output:

customer_id|field_id|month_id|m1     |m2    |n1   |n2  
1001       |  10    |01      |10     |20    |10   |20  
1002       |  20    |01      |20     |30    |20   |30  
1003       |  30    |01      |30     |40    |30   |40  
1001       |  10    |02      |40     |50    |50   |70  
1002       |  20    |02      |50     |60    |70   |90
1003       |  30    |02      |60     |70    |90   |110  
1001       |  10    |03      |70     |80    |120  |150
1002       |  20    |03      |80     |90    |150  |180
1003       |  30    |03      |90     |100   |180  |210

pyspark variable not defined error using window function in dataframe select operation

Answers (1)

Related Questions