Cuteufo
Cuteufo

Reputation: 575

Python pandas: summing value of two or more DataFrames with identical value in multiple columns

I have two DataFrames, like:

df1 = pd.DataFrame([["tom", 1, 2, 3], ["bob", 3, 4, 5], ["ali", 6, 7, 8]], columns=["name", "A", "B", "C"])
df1
Out[44]: 
  name  A  B  C
0  tom  1  2  3
1  bob  3  4  5
2  ali  6  7  8
df2 = pd.DataFrame([["rob", 1, 2, 3], ["ali", 6, 7, 8]], columns=["name", "A", "B", "D"])
df2
Out[46]: 
  name  A  B  D
0  rob  1  2  3
1  ali  6  7  8

how can I perform a sum operations to the values with same 'name' and same column, and get a result DataFrame like:

  name A   B   C   D
0  tom 1   2   3 NaN     # <- tom and bob don't shows up in df2, so the sum is identical
1  bob 3   4   5 NaN     #    to their values in df1
2  rob 1   2 NaN   3     # <- rob only shows up on df2, so the sum equal to its df2 values
3  ali 12 14   8   8     # <- ali's A and B are sum up, and C and D are identical to their
                         #    corresponding value in df1 and df2

please note, I don't know what names will show up in 'name' columns of both DataFrames.

And, because I have more than two such DataFrames to sum up, how can I do this with all of them in one operation, instead of sum up one by one, if that is possible? Many thanks.

Upvotes: 2

Views: 958

Answers (2)

SeaBean
SeaBean

Reputation: 23217

(1) If your df1 and df2 does not contain duplicate names within one dataframe (still can have duplicates across 2 dataframes):

You can use .add() with parameter fill_value=0 after we set_index() to column name on the 2 dataframes, as follows:

df3 = (df1.set_index('name')
          .add(df2.set_index('name'), fill_value=0)
      ).reset_index()

NaN values for non-matching entries are retained in this way.

By setting index to the same name column, Pandas knows to align rows with the same row index (same name) before the add operation. Hence, we can get the correct sum with same name.

Result:

print(df3)

  name     A     B    C    D
0  ali  12.0  14.0  8.0  8.0
1  bob   3.0   4.0  5.0  NaN
2  rob   1.0   2.0  NaN  3.0
3  tom   1.0   2.0  3.0  NaN

(2) If your df1 and df2 contain duplicate names within one dataframe:

If your df1 and df2 contain duplicate names in within one dataframe, you can use:

df3 = (pd.concat([df1, df2])
         .groupby('name', as_index=False)
         .sum()
      ) 

Result:

print(df3)

  name   A   B     C    D
0  ali  12  14   8.0  8.0
1  bob   3   4   5.0  0.0
2  rob   1   2   0.0  3.0
3  tom  12  14  16.0  0.0

In this way, the NaN values of non-matching entries are padded with 0.

Upvotes: 1

Vivs
Vivs

Reputation: 475

Hope this solves your problem. I have modified Nan as 0.

import pandas as pd
df1 = pd.DataFrame([["tom", 1, 2, 3], ["bob", 3, 4, 5], ["ali", 6, 7, 8]], columns=["name", "A", "B", "C"])
df2 = pd.DataFrame([["rob", 1, 2, 3], ["ali", 6, 7, 8]], columns=["name", "A", "B", "D"])
df3=pd.concat([df1, df2], ignore_index=True, sort=False)
df4=df3.groupby(['name'])['A','B','C','D'].sum()
print(df4)

Upvotes: 1

Related Questions