is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append)

Question

suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?

for example, df1, df2, df3 are as follows:

df1

   id   base1 base2 col1 col2 col3 col4
   1    1     100   30    1    2    3
   2    2     200   40    2    3    4
   3    3     300   20    4    4    5

df2

   id   base1 base2 col1
   5    4     100   15
   6    1     99    18
   7    2     89    9

df3

   id   base1 base2 col1 col2
   9    2     77    12    3
   10   1     89    16    5
   11   2     88    10    7

to be:

   id   base1 base2 col1 col2 col3 col4
   1    1     100   30    1    2    3
   2    2     200   40    2    3    4
   3    3     300   20    4    4    5
   5    4     100   15   NaN  NaN  NaN 
   6    1     99    18   NaN  NaN  NaN 
   7    2     89    9    NaN  NaN  NaN 
   9    2     77    12    3   NaN  NaN
   10   1     89    16    5   NaN  NaN
   11   2     88    10    7   NaN  NaN

currently I use this code:

from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row

def customUnion(df1, df2):
    cols1 = df1.columns
    cols2 = df2.columns
    total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
    def expr(mycols, allcols):
        def processCols(colname):
            if colname in mycols:
                return colname
            else:
                return lit(None).alias(colname)
        cols = map(processCols, allcols)
        return list(cols)
    appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
    return appended

df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)

however, if I keep creating new dataframe like df4,df5,etc. (100+) my code becomes messy.

is there a way to code it in a easier way?

Thanks in advance

ernest_k · Accepted Answer

You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...

dataframes = [df1,df2,df3] # load data frames

Compute the set of all possible columns:

all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}

A function to add missing columns to a DF:

def add_missing_cols(df, cols):
    v = df
    for col in [c for c in cols if (not c in df.columns)]:
        v = v.withColumn(col, f.lit(None))
    return v

completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]

res = completed_dfs[0]
for df in completed_dfs[1:]:
    res = res.unionAll(df)

res.show()

+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
|  1|    1|  100|  30|   1|   2|   3|
|  2|    2|  200|  40|   2|   3|   4|
|  3|    3|  300|  20|   4|   4|   5|
|  5|    4|  100|  15|null|null|null|
|  6|    1|   99|  18|null|null|null|
|  7|    2|   89|   9|null|null|null|
|  9|    2|   77|  12|   3|null|null|
| 10|    1|   89|  16|   5|null|null|
| 11|    2|   88|  10|   7|null|null|
+---+-----+-----+----+----+----+----+

is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append)

Answers (1)

Related Questions