Prachi
Prachi

Reputation: 544

How to merge multiple pandas dataframes into one original dataframe in the most efficient way?

How to merge 4 pandas dataframes into one original dataframe in the most efficient way? Below shows the original dataframe df whose 4 columns CC1, CC2, CC3 and CC4 need to be updated with the respective columns from df1, df2, df3 and df4.

The common columns for all dataframes are TD and PD.

EDIT

df

TD   PD  CC1    CC2    CC3    CC4    A     B     C
10   1   Null   Null   Null   Null   1     1     0
10   2   Null   Null   Null   Null   0     1     1
10   3   Null   Null   Null   Null   Null  2     Null
20   1   Null   Null   Null   Null   Null  0     1
20   4   Null   Null   Null   Null   2     Null  Null
30   1   Null   Null   Null   Null   4     0     2
30   3   Null   Null   Null   Null   8     Null  5
30   5   Null   Null   Null   Null   Null  1     1
40   2   Null   Null   Null   Null   0     0     0

df1

TD   PD   CC1
10   2     0
20   1     5
20   4     2
30   3    10

df2

TD   PD   CC2
10   1     15
10   2     10
20   4     20

df3

TD   PD   CC3
10   3     0
20   4     5
30   1     9

df4

TD   PD   CC4
20   4     0
30   1     15
30   3     20

Expected output after merge is as shown below:

df

TD   PD  CC1    CC2    CC3    CC4   A      B     C
10   1   Null   15    Null   Null   1      1     0
10   2    0     10    Null   Null   0      1     1
10   3   Null  Null    0     Null   Null   2     Null
20   1    5    Null   Null   Null   Null   0     1
20   4    2     20     5      0     2      Null  Null
30   1   Null  Null    9      15    4      0     2
30   3    10   Null   Null    20    8      Null  5
30   5   Null   Null   Null  Null   Null   1     1
40   2   Null   Null   Null  Null   0      0     0

Here the other extra columns in df (A, B and C) remain unaffected. Also the number of total rows in df1, df2, df3 and df4 does not equal to number of rows in df. How to achieve this in the fastest possible way and can it be done in just one statement or 4 different statements will have to be used here?

Any help deeply appreciated. Many thanks in advance.

Upvotes: 0

Views: 471

Answers (1)

jezrael
jezrael

Reputation: 862406

Use concat in list comprehension for create MultiIndex by TD and PD used for outer join by DataFrame.set_index and then DataFrame.reset_index for MultiIndex to columns:

dfs = [df1, df2, df3, df4]
dfnew = pd.concat([x.set_index(['TD', 'PD']) for x in dfs], axis=1).reset_index()
print (dfnew)
   TD  PD   CC1   CC2  CC3   CC4
0  10   1   NaN  15.0  NaN   NaN
1  10   2   0.0  10.0  NaN   NaN
2  10   3   NaN   NaN  0.0   NaN
3  20   1   5.0   NaN  NaN   NaN
4  20   4   2.0  20.0  5.0   0.0
5  30   1   NaN   NaN  9.0  15.0
6  30   3  10.0   NaN  NaN  20.0


df = df.combine_first(dfnew)

EDIT: Error means there are duplicates in combinations TD, PD.

#changed data for simulate error
print (df1)
   TD  PD  CC1
0  10   2    0
1  20   4    5 <- duplicates
2  20   4    2 <- duplicates
3  20   4   10 <- duplicates

One idea is remove duplicated rows, so in output is only first value:

dfs = [df1, df2, df3, df4]
dfnew = pd.concat([x.drop_duplicates(['TD','PD']).set_index(['TD', 'PD']) 
                   for x in dfs], axis=1).reset_index()
print (dfnew)
   TD  PD  CC1   CC2  CC3   CC4
0  10   1  NaN  15.0  NaN   NaN
1  10   2  0.0  10.0  NaN   NaN
2  10   3  NaN   NaN  0.0   NaN
3  20   4  5.0  20.0  5.0   0.0 <- first value 5
4  30   1  NaN   NaN  9.0  15.0
5  30   3  NaN   NaN  NaN  20.0

Another idea is aggregated, e.g. by sum:

dfnew = pd.concat([x.groupby(['TD', 'PD']).sum() for x in dfs], axis=1).reset_index()
print (dfnew)
   TD  PD   CC1   CC2  CC3   CC4
0  10   1   NaN  15.0  NaN   NaN
1  10   2   0.0  10.0  NaN   NaN
2  10   3   NaN   NaN  0.0   NaN
3  20   4  17.0  20.0  5.0   0.0 <- summed values - 17
4  30   1   NaN   NaN  9.0  15.0
5  30   3   NaN   NaN  NaN  20.0

Upvotes: 2

Related Questions