Reputation: 544
How to merge 4 pandas dataframes into one original dataframe in the most efficient way?
Below shows the original dataframe df
whose 4 columns CC1
, CC2
, CC3
and CC4
need to be updated with the respective columns from df1
, df2
, df3
and df4
.
The common columns for all dataframes are TD
and PD
.
EDIT
df
TD PD CC1 CC2 CC3 CC4 A B C
10 1 Null Null Null Null 1 1 0
10 2 Null Null Null Null 0 1 1
10 3 Null Null Null Null Null 2 Null
20 1 Null Null Null Null Null 0 1
20 4 Null Null Null Null 2 Null Null
30 1 Null Null Null Null 4 0 2
30 3 Null Null Null Null 8 Null 5
30 5 Null Null Null Null Null 1 1
40 2 Null Null Null Null 0 0 0
df1
TD PD CC1
10 2 0
20 1 5
20 4 2
30 3 10
df2
TD PD CC2
10 1 15
10 2 10
20 4 20
df3
TD PD CC3
10 3 0
20 4 5
30 1 9
df4
TD PD CC4
20 4 0
30 1 15
30 3 20
Expected output after merge is as shown below:
df
TD PD CC1 CC2 CC3 CC4 A B C
10 1 Null 15 Null Null 1 1 0
10 2 0 10 Null Null 0 1 1
10 3 Null Null 0 Null Null 2 Null
20 1 5 Null Null Null Null 0 1
20 4 2 20 5 0 2 Null Null
30 1 Null Null 9 15 4 0 2
30 3 10 Null Null 20 8 Null 5
30 5 Null Null Null Null Null 1 1
40 2 Null Null Null Null 0 0 0
Here the other extra columns in df
(A
, B
and C
) remain unaffected.
Also the number of total rows in df1, df2, df3 and df4 does not equal to number of rows in df. How to achieve this in the fastest possible way and can it be done in just one statement or 4 different statements will have to be used here?
Any help deeply appreciated. Many thanks in advance.
Upvotes: 0
Views: 471
Reputation: 862406
Use concat
in list comprehension for create MultiIndex
by TD
and PD
used for outer join by DataFrame.set_index
and then DataFrame.reset_index
for MultiIndex
to columns:
dfs = [df1, df2, df3, df4]
dfnew = pd.concat([x.set_index(['TD', 'PD']) for x in dfs], axis=1).reset_index()
print (dfnew)
TD PD CC1 CC2 CC3 CC4
0 10 1 NaN 15.0 NaN NaN
1 10 2 0.0 10.0 NaN NaN
2 10 3 NaN NaN 0.0 NaN
3 20 1 5.0 NaN NaN NaN
4 20 4 2.0 20.0 5.0 0.0
5 30 1 NaN NaN 9.0 15.0
6 30 3 10.0 NaN NaN 20.0
df = df.combine_first(dfnew)
EDIT: Error means there are duplicates in combinations TD
, PD
.
#changed data for simulate error
print (df1)
TD PD CC1
0 10 2 0
1 20 4 5 <- duplicates
2 20 4 2 <- duplicates
3 20 4 10 <- duplicates
One idea is remove duplicated rows, so in output is only first value:
dfs = [df1, df2, df3, df4]
dfnew = pd.concat([x.drop_duplicates(['TD','PD']).set_index(['TD', 'PD'])
for x in dfs], axis=1).reset_index()
print (dfnew)
TD PD CC1 CC2 CC3 CC4
0 10 1 NaN 15.0 NaN NaN
1 10 2 0.0 10.0 NaN NaN
2 10 3 NaN NaN 0.0 NaN
3 20 4 5.0 20.0 5.0 0.0 <- first value 5
4 30 1 NaN NaN 9.0 15.0
5 30 3 NaN NaN NaN 20.0
Another idea is aggregated, e.g. by sum
:
dfnew = pd.concat([x.groupby(['TD', 'PD']).sum() for x in dfs], axis=1).reset_index()
print (dfnew)
TD PD CC1 CC2 CC3 CC4
0 10 1 NaN 15.0 NaN NaN
1 10 2 0.0 10.0 NaN NaN
2 10 3 NaN NaN 0.0 NaN
3 20 4 17.0 20.0 5.0 0.0 <- summed values - 17
4 30 1 NaN NaN 9.0 15.0
5 30 3 NaN NaN NaN 20.0
Upvotes: 2