Kyle H
Kyle H

Reputation: 3293

Pandas - merge two dataframes, sum similar columns, only keep rows with matching keys (inner join)

I'm stuck between pandas concat and merge function, struggling to get the best of both. I need to inner-join the rows on name and date, summing common columns A and B, and keep the value from category (really I could add this to the join if needed, they match).

Example-

df1

| name | date     | A | B | category |
|------|----------|---|---|----------|
| W    | 1/1/2020 | 1 | 1 | home     |
| W    | 1/2/2020 | 1 | 1 | home     |
| Y    | 1/3/2020 | 1 | 1 | garden   |
| Y    | 1/4/2020 | 1 | 1 | garden   |

df2

| name | date     | A | B | category |
|------|----------|---|---|----------|
| W    | 1/1/2020 | 2 | 2 | home     |
| W    | 1/2/2020 | 1 | 1 | home     |
| Y    | 1/3/2020 | 1 | 1 | garden   |
| Y    | 1/5/2020 | 1 | 1 | garden   |

desired result -

| name | date     | A | B | category |
|------|----------|---|---|----------|
| W    | 1/1/2020 | 3 | 3 | home     |
| W    | 1/2/2020 | 2 | 2 | home     |
| Y    | 1/3/2020 | 2 | 2 | garden   |

I've found that merge will join the rows, but duplicates any columns not joined on instead of summing them.

concat will sum the rows, but doesn't do an inner join so I get rows with data from just one dataframe or the other. I've tried pd.concat([df_1, df_2], join='inner', but the "inner" doesn't do what I want.

Upvotes: 0

Views: 2262

Answers (3)

M-M
M-M

Reputation: 450

How about this one:

dff = df1
    .set_index(['name', 'date', 'category'])
    .add(df2.set_index(['name', 'date', 'category'])
    .reset_index()
    .dropna()
    .reindex(columns=df1.columns)

dff

Output:

    name    date    A   B   category
 0  W   1/1/2020    3   3   home
 1  W   1/2/2020    2   2   home
 2  Y   1/3/2020    2   2   garden

This is what you want?

Upvotes: 0

ipj
ipj

Reputation: 3598

First "naive" solution:

d3 = pd.merge(d1,d2, left_on = ["name","date","category"], right_on = ["name","date","category"])
d3 = d3.assign(A = d3.A_x + d3.A_y, B = d3.B_x + d3.B_y, ).drop(columns=["A_y","A_x","B_y","B_x"])

Better solution, no need to manualy adding columns:

key = ["name","date","category"]
d3 = pd.merge(d1,d2, left_on = key, right_on = key)[key]
d4 = pd.concat([d1, d2]).groupby(by=key).sum()
d5 = pd.merge(d3,d4,right_on = key, left_on = key)

Result:

     name        date    category  A  B
0   W       1/1/2020    home       3  3
1   W       1/2/2020    home       2  2
2   Y       1/3/2020    garden     2  2

Using pd.merge You can specify composite key ["name","date","category"] joining 2 dataframes.

Upvotes: 0

Xhattam
Xhattam

Reputation: 307

I would use a groupby on name, date, and category, with a sum aggregate (to sum on A and B). This gives extra columns though, as the rows with the dates 1/4/2020 and 1/5/2020 just don't disappear, but are summed at 1.

This is the code:

import pandas as pd

df = pd.DataFrame({'name': ['W', 'W', 'Y', 'Y'], 
                   'date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020 '],
                   'A': [1, 1, 1, 1],
                   'B': [1, 1, 1, 1],
                   'category': ['home', 'home', 'garden', 'garden']})

df2 = pd.DataFrame({'name': ['W', 'W', 'Y', 'Y'], 
                   'date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/5/2020 '],
                   'A': [2, 1, 1, 1],
                   'B': [2, 1, 1, 1],
                   'category': ['home', 'home', 'garden', 'garden']})

df3 = pd.concat([df, df2]).groupby(by=['name', 'date', 'category']).sum()

This gives you:

                         A  B
name date      category      
W    1/1/2020  home      3  3
     1/2/2020  home      2  2
Y    1/3/2020  garden    2  2
     1/4/2020  garden    1  1
     1/5/2020  garden    1  1

You can then filter on the values of A or/and B if you don't want to see rows with a sum = 1

Hope this helps.

Upvotes: 1

Related Questions