sagungrp
sagungrp

Reputation: 161

Add the values of two tables based on a combination of columns

I have two tables:

df1 = pd.DataFrame({
    "c_id": [2000,3000,3000], 
    "cloud":["GCP","GCP","Azure"], 
    "invoice":[100,100,300]
})

c_id    cloud   invoice
2000    GCP     100
3000    GCP     100
3000    Azure   300

df2 = pd.DataFrame({
    "c_id": [1000,2000,2000,3000,3000], 
    "cloud":["Azure","GCP","Azure","AWS","Azure"], 
    "invoice":[200,200,300,100,100]
})

c_id    cloud   invoice
1000    Azure   200
2000    GCP     200
2000    Azure   300
3000    AWS     100
3000    Azure   100

I want to add the two tables based on the combination of columns c_id and cloud. The result I'm looking for is:

c_id    cloud   invoice
1000    Azure   200
2000    Azure   300
2000    GCP     300
3000    AWS     100
3000    Azure   400
3000    GCP     100

In my example, I only displayed the column invoice. In my actual dataset, there are actually more than 40 columns with a lot more constraints. Several of the columns only have value when the cloud is Azure, while others only have value when the cloud is either Azure or GCP

Is there's a clean way to add df1 and df2?

Upvotes: 1

Views: 57

Answers (2)

Mayank Porwal
Mayank Porwal

Reputation: 34086

You can also use df.set_index:

df1.set_index(['c_id', 'cloud']).add(df2.set_index(['c_id', 'cloud']), fill_value=0).reset_index()

Output:

    c_id  cloud  invoice
0  1000  Azure   200.00
1  2000  Azure   300.00
2  2000    GCP   300.00
3  3000    AWS   100.00
4  3000  Azure   400.00
5  3000    GCP   100.00

Upvotes: 2

jezrael
jezrael

Reputation: 863166

Use concat with aggregate sum:

df1 = pd.DataFrame({
    "c_id": [2000,3000,3000], 
    "cloud":["GCP","GCP","Azure"], 
    "invoice":[100,100,300]
})
print (df1)
   c_id  cloud  invoice
0  2000    GCP      100
1  3000    GCP      100
2  3000  Azure      300


df2 = pd.DataFrame({
    "c_id": [1000,2000,2000,3000,3000], 
    "cloud":["Azure","GCP","Azure","AWS","Azure"], 
    "invoice":[200,200,300,100,100]
})
print (df2)
   c_id  cloud  invoice
0  1000  Azure      200
1  2000    GCP      200
2  2000  Azure      300
3  3000    AWS      100
4  3000  Azure      100

df = pd.concat([df1, df2]).groupby(['c_id','cloud'], as_index=False).sum()
print (df)
   c_id  cloud  invoice
0  1000  Azure      200
1  2000  Azure      300
2  2000    GCP      300
3  3000    AWS      100
4  3000  Azure      400
5  3000    GCP      100

Upvotes: 1

Related Questions