sum rows from two different data frames based on the value of columns

Question

I have two data frames

df1

            ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    100
1           11  2023             Lyon      Chicago,Paris    200
2           11  2023           Berlin              Paris    300
3           12  2022          Newyork            Chicago    150
4           12  2022             Lyon      Chicago,Paris    250
5           12  2022           Berlin              Paris    400

df2

            ID  Year Primary_Location  Sales
0           11  2023          Chicago    150
1           11  2023            Paris    200
2           12  2022          Chicago    300
3           12  2022            Paris    350

I would like for each group having the same ID & Year: to add the column Sales from df2 to Sales in df1 where Primary_Location in df2 appear (contained) in Secondary_Location in df1.

For example: For ID=11 & Year=2023, Sales for Lyon would be added to Sales for Chicago & Sales for Paris of df_2.

New Sales of Paris for that row would be 200+150+200=550.

The expected output would be :

df_primary_output



            ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    250
1           11  2023             Lyon      Chicago,Paris    550
2           11  2023           Berlin              Paris    500
3           12  2022          Newyork            Chicago    400
4           12  2022             Lyon      Chicago,Paris    900
5           12  2022           Berlin              Paris    750

Here are the dataframes to start with :

import pandas as pd

df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
                   'Year': [2023, 2023, 2023, 2022, 2022, 2022],
                   'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
                   'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
                   'Sales': [100, 200, 300, 150, 250, 400]
                   })

df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
                   'Year': [2023, 2023, 2022, 2022],
                   'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
                   'Sales': [150, 200, 300, 350]
                   })

EDIT: pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Would be great if the solution could work for these inputs as well:

df1

       Day  ID  Year Primary_Location Secondary_Location  Sales
0       1   11  2023          NewYork            Chicago    100
1       1   11  2023           Berlin            Chicago    300
2       1   11  2022          Newyork            Chicago    150
3       1   11  2022           Berlin            Chicago    400

df2

     Day    ID  Year Primary_Location  Sales
0     1     11  2023          Chicago    150
1     1     11  2022          Chicago    300

The expected output would be :

df_primary_output



       Day  ID  Year Primary_Location Secondary_Location  Sales
0       1   11  2023          NewYork            Chicago    250
1       1   11  2023           Berlin            Chicago    450
2       1   11  2022          Newyork            Chicago    450
3       1   11  2022           Berlin            Chicago    700

rhug123 · Accepted Answer

This should work:

s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))

or

df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(',')) #split Secondary_Location column into list and explode it so each row has one value
.explode('Secondary_Location'))


(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1) #create a series where ID, Year and Secondary_Location are a combined into a tuple so we can map our series created below to bring in the values needed.
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales']) #create a series with lookup values in index, and make a series by selecting Sales column
.groupby(level=0).sum() #when exploding the column above, the index was repeated, so groupby(level=0).sum() will combine back to original form.
.add(df1['Sales'])) #add in original sales column

Original Answer:

s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))

Output:

   ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023             Lyon      Chicago,Paris    550
2  11  2023           Berlin              Paris    500
3  12  2022          Newyork            Chicago    450
4  12  2022             Lyon      Chicago,Paris    900
5  12  2022           Berlin              Paris    750

sum rows from two different data frames based on the value of columns

Answers (2)

Related Questions