Reputation: 575
I have two DataFrames, like:
df1 = pd.DataFrame([["tom", 1, 2, 3], ["bob", 3, 4, 5], ["ali", 6, 7, 8]], columns=["name", "A", "B", "C"])
df1
Out[44]:
name A B C
0 tom 1 2 3
1 bob 3 4 5
2 ali 6 7 8
df2 = pd.DataFrame([["rob", 1, 2, 3], ["ali", 6, 7, 8]], columns=["name", "A", "B", "D"])
df2
Out[46]:
name A B D
0 rob 1 2 3
1 ali 6 7 8
how can I perform a sum operations to the values with same 'name' and same column, and get a result DataFrame like:
name A B C D
0 tom 1 2 3 NaN # <- tom and bob don't shows up in df2, so the sum is identical
1 bob 3 4 5 NaN # to their values in df1
2 rob 1 2 NaN 3 # <- rob only shows up on df2, so the sum equal to its df2 values
3 ali 12 14 8 8 # <- ali's A and B are sum up, and C and D are identical to their
# corresponding value in df1 and df2
please note, I don't know what names will show up in 'name' columns of both DataFrames.
And, because I have more than two such DataFrames to sum up, how can I do this with all of them in one operation, instead of sum up one by one, if that is possible? Many thanks.
Upvotes: 2
Views: 958
Reputation: 23217
(1) If your df1 and df2 does not contain duplicate names within one dataframe (still can have duplicates across 2 dataframes):
You can use .add()
with parameter fill_value=0
after we set_index()
to column name
on the 2 dataframes, as follows:
df3 = (df1.set_index('name')
.add(df2.set_index('name'), fill_value=0)
).reset_index()
NaN
values for non-matching entries are retained in this way.
By setting index to the same name
column, Pandas knows to align rows with the same row index (same name) before the add operation. Hence, we can get the correct sum with same name.
Result:
print(df3)
name A B C D
0 ali 12.0 14.0 8.0 8.0
1 bob 3.0 4.0 5.0 NaN
2 rob 1.0 2.0 NaN 3.0
3 tom 1.0 2.0 3.0 NaN
(2) If your df1 and df2 contain duplicate names within one dataframe:
If your df1 and df2 contain duplicate names in within one dataframe, you can use:
df3 = (pd.concat([df1, df2])
.groupby('name', as_index=False)
.sum()
)
Result:
print(df3)
name A B C D
0 ali 12 14 8.0 8.0
1 bob 3 4 5.0 0.0
2 rob 1 2 0.0 3.0
3 tom 12 14 16.0 0.0
In this way, the NaN
values of non-matching entries are padded with 0
.
Upvotes: 1
Reputation: 475
Hope this solves your problem. I have modified Nan as 0.
import pandas as pd
df1 = pd.DataFrame([["tom", 1, 2, 3], ["bob", 3, 4, 5], ["ali", 6, 7, 8]], columns=["name", "A", "B", "C"])
df2 = pd.DataFrame([["rob", 1, 2, 3], ["ali", 6, 7, 8]], columns=["name", "A", "B", "D"])
df3=pd.concat([df1, df2], ignore_index=True, sort=False)
df4=df3.groupby(['name'])['A','B','C','D'].sum()
print(df4)
Upvotes: 1