pyspark dataframe inside a for loop

Question

I have a situation as below. I have a master dataframe DF1. I am processing inside a for-loop to reflect the changes and my pseudo codes are as below.

for Year in [2019, 2020]:
  query_west = query_{Year}
  df_west = spark.sql(query_west)
  df_final = DF1.join(df_west, on['ID'], how='left')

In this case df_final is getting joined with query and getting updated every iteration right? I want that changes to be reflected happening on my main dataframe DF1 every iteration inside the for loop.

Please let me know whether my logic is right. Thanks.

mck · Accepted Answer

As the comment by @venky__ suggested, you need to add another line DF1 = df_final at the end of the for loop, in order to make sure DF1 is updated in each iteration.

Another way is to use reduce to combine the joins all at once. e.g.

from functools import reduce

dfs = [DF1]
for Year in [2019, 2020]:
  query_west = f'query_{Year}'
  df_west = spark.sql(query_west)
  dfs.append(df_west)

df_final = reduce(lambda x, y: x.join(y, 'ID', 'left'), dfs)

which is equivalent to

df_final = DF1.join(spark.sql('query_2019'), 'ID', 'left').join(spark.sql('query_2020'), 'ID', 'left')

pyspark dataframe inside a for loop

Answers (1)

Related Questions