Reputation: 988
I have a situation as below. I have a master dataframe DF1. I am processing inside a for-loop to reflect the changes and my pseudo codes are as below.
for Year in [2019, 2020]:
query_west = query_{Year}
df_west = spark.sql(query_west)
df_final = DF1.join(df_west, on['ID'], how='left')
In this case df_final is getting joined with query and getting updated every iteration right? I want that changes to be reflected happening on my main dataframe DF1 every iteration inside the for loop.
Please let me know whether my logic is right. Thanks.
Upvotes: 1
Views: 973
Reputation: 42352
As the comment by @venky__ suggested, you need to add another line DF1 = df_final
at the end of the for loop, in order to make sure DF1 is updated in each iteration.
Another way is to use reduce
to combine the joins all at once. e.g.
from functools import reduce
dfs = [DF1]
for Year in [2019, 2020]:
query_west = f'query_{Year}'
df_west = spark.sql(query_west)
dfs.append(df_west)
df_final = reduce(lambda x, y: x.join(y, 'ID', 'left'), dfs)
which is equivalent to
df_final = DF1.join(spark.sql('query_2019'), 'ID', 'left').join(spark.sql('query_2020'), 'ID', 'left')
Upvotes: 1