create dynamic names for pyspark dataframe inside a for loop

Question

I have a main dataframe df_PROD, and for certain range of years, I want to filter those records from the main df and if the number of records more than 0, will push them into a separate df (i.e df_PROD_year) and append that year into a list which can be used for later purpose.

I am trying to create dynamic names for dataframe inside a for loop as below and if the records are more than 0, I am adding into a separate df_year and I am trying to append that year into another list as below.

PROD_years_list = []
year=int(datetime.datetime.today().year)
for i in range (year, 2016, -1 ):
  print(i)
  df_PROD_{i} = df_PROD.filter(col("Year") == i)
  if df_PROD_{i}.count() > 0:
    PROD_years_list.append(i)
print(PROD_years_list)

But I get invalid syntax error for the line:

df_PROD_{i} = df_PROD.filter(col("Year") == i)

How to dynamically name a dataframe inside a for loop? Thanks.

blackbishop · Accepted Answer

Using a dict is probably a better option for your need. You store each dataframe with the corresponding year as a key:

PROD_years = {}
year=int(datetime.datetime.today().year)

for i in range (year, 2016, -1 ):
  df = df_PROD.filter(col("Year") == i)
  if df.count() > 0:
    PROD_years[i] = df

print(PROD_years)

create dynamic names for pyspark dataframe inside a for loop

Answers (1)

Related Questions