Reputation: 161
I have two tables:
df1 = pd.DataFrame({
"c_id": [2000,3000,3000],
"cloud":["GCP","GCP","Azure"],
"invoice":[100,100,300]
})
c_id cloud invoice
2000 GCP 100
3000 GCP 100
3000 Azure 300
df2 = pd.DataFrame({
"c_id": [1000,2000,2000,3000,3000],
"cloud":["Azure","GCP","Azure","AWS","Azure"],
"invoice":[200,200,300,100,100]
})
c_id cloud invoice
1000 Azure 200
2000 GCP 200
2000 Azure 300
3000 AWS 100
3000 Azure 100
I want to add the two tables based on the combination of columns c_id
and cloud
. The result I'm looking for is:
c_id cloud invoice
1000 Azure 200
2000 Azure 300
2000 GCP 300
3000 AWS 100
3000 Azure 400
3000 GCP 100
In my example, I only displayed the column invoice
. In my actual dataset, there are actually more than 40 columns with a lot more constraints. Several of the columns only have value when the cloud
is Azure
, while others only have value when the cloud
is either Azure
or GCP
Is there's a clean way to add df1
and df2
?
Upvotes: 1
Views: 57
Reputation: 34086
You can also use df.set_index
:
df1.set_index(['c_id', 'cloud']).add(df2.set_index(['c_id', 'cloud']), fill_value=0).reset_index()
Output:
c_id cloud invoice
0 1000 Azure 200.00
1 2000 Azure 300.00
2 2000 GCP 300.00
3 3000 AWS 100.00
4 3000 Azure 400.00
5 3000 GCP 100.00
Upvotes: 2
Reputation: 863166
Use concat
with aggregate sum
:
df1 = pd.DataFrame({
"c_id": [2000,3000,3000],
"cloud":["GCP","GCP","Azure"],
"invoice":[100,100,300]
})
print (df1)
c_id cloud invoice
0 2000 GCP 100
1 3000 GCP 100
2 3000 Azure 300
df2 = pd.DataFrame({
"c_id": [1000,2000,2000,3000,3000],
"cloud":["Azure","GCP","Azure","AWS","Azure"],
"invoice":[200,200,300,100,100]
})
print (df2)
c_id cloud invoice
0 1000 Azure 200
1 2000 GCP 200
2 2000 Azure 300
3 3000 AWS 100
4 3000 Azure 100
df = pd.concat([df1, df2]).groupby(['c_id','cloud'], as_index=False).sum()
print (df)
c_id cloud invoice
0 1000 Azure 200
1 2000 Azure 300
2 2000 GCP 300
3 3000 AWS 100
4 3000 Azure 400
5 3000 GCP 100
Upvotes: 1