Efficiently merging a large number of pyspark DataFrames

Question

I'm trying to perform dataframe union of thousands of dataframes in a Python list. I'm using two approaches I found. The first one is by means of for loop union and the second one is using functools.reduce. Both them work well for toy examples, however for thousands of dataframes I'm experimenting a severe overhead, probably caused by code out of the JVM, sequentialy appending each dataframe at a time (using both merging approaches).

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

# The reduce approach
def unionAll(dfs):
    return reduce(DataFrame.unionAll, dfs)

df_list = [td2, td3, td4, td5, td6, td7, td8, td9, td10]
df = unionAll(df_list)

#The loop approach
df = df_list[0].union(df_list[1])
for d in df_list[2:]:
    df = df.union(d)

The question is how to perform this multiple dataframe operation efficiently, probably circunventing the overhead caused by merging dataframes one-by-one.

Thank you very much

Efficiently merging a large number of pyspark DataFrames

Answers (1)

Related Questions