Reputation: 3293
I'm stuck between pandas concat
and merge
function, struggling to get the best of both. I need to inner-join the rows on name and date, summing common columns A and B, and keep the value from category (really I could add this to the join if needed, they match).
Example-
df1
| name | date | A | B | category |
|------|----------|---|---|----------|
| W | 1/1/2020 | 1 | 1 | home |
| W | 1/2/2020 | 1 | 1 | home |
| Y | 1/3/2020 | 1 | 1 | garden |
| Y | 1/4/2020 | 1 | 1 | garden |
df2
| name | date | A | B | category |
|------|----------|---|---|----------|
| W | 1/1/2020 | 2 | 2 | home |
| W | 1/2/2020 | 1 | 1 | home |
| Y | 1/3/2020 | 1 | 1 | garden |
| Y | 1/5/2020 | 1 | 1 | garden |
desired result -
| name | date | A | B | category |
|------|----------|---|---|----------|
| W | 1/1/2020 | 3 | 3 | home |
| W | 1/2/2020 | 2 | 2 | home |
| Y | 1/3/2020 | 2 | 2 | garden |
I've found that merge
will join the rows, but duplicates any columns not joined on instead of summing them.
concat
will sum the rows, but doesn't do an inner join so I get rows with data from just one dataframe or the other. I've tried pd.concat([df_1, df_2], join='inner'
, but the "inner" doesn't do what I want.
Upvotes: 0
Views: 2262
Reputation: 450
How about this one:
dff = df1
.set_index(['name', 'date', 'category'])
.add(df2.set_index(['name', 'date', 'category'])
.reset_index()
.dropna()
.reindex(columns=df1.columns)
dff
Output:
name date A B category
0 W 1/1/2020 3 3 home
1 W 1/2/2020 2 2 home
2 Y 1/3/2020 2 2 garden
This is what you want?
Upvotes: 0
Reputation: 3598
First "naive" solution:
d3 = pd.merge(d1,d2, left_on = ["name","date","category"], right_on = ["name","date","category"])
d3 = d3.assign(A = d3.A_x + d3.A_y, B = d3.B_x + d3.B_y, ).drop(columns=["A_y","A_x","B_y","B_x"])
Better solution, no need to manualy adding columns:
key = ["name","date","category"]
d3 = pd.merge(d1,d2, left_on = key, right_on = key)[key]
d4 = pd.concat([d1, d2]).groupby(by=key).sum()
d5 = pd.merge(d3,d4,right_on = key, left_on = key)
Result:
name date category A B
0 W 1/1/2020 home 3 3
1 W 1/2/2020 home 2 2
2 Y 1/3/2020 garden 2 2
Using pd.merge
You can specify composite key ["name","date","category"]
joining 2 dataframes.
Upvotes: 0
Reputation: 307
I would use a groupby on name
, date
, and category
, with a sum
aggregate (to sum on A
and B
). This gives extra columns though, as the rows with the dates 1/4/2020
and 1/5/2020
just don't disappear, but are summed at 1.
This is the code:
import pandas as pd
df = pd.DataFrame({'name': ['W', 'W', 'Y', 'Y'],
'date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020 '],
'A': [1, 1, 1, 1],
'B': [1, 1, 1, 1],
'category': ['home', 'home', 'garden', 'garden']})
df2 = pd.DataFrame({'name': ['W', 'W', 'Y', 'Y'],
'date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/5/2020 '],
'A': [2, 1, 1, 1],
'B': [2, 1, 1, 1],
'category': ['home', 'home', 'garden', 'garden']})
df3 = pd.concat([df, df2]).groupby(by=['name', 'date', 'category']).sum()
This gives you:
A B
name date category
W 1/1/2020 home 3 3
1/2/2020 home 2 2
Y 1/3/2020 garden 2 2
1/4/2020 garden 1 1
1/5/2020 garden 1 1
You can then filter on the values of A
or/and B
if you don't want to see rows with a sum = 1
Hope this helps.
Upvotes: 1