Reputation: 739
I have two data frames
df1
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 100
1 11 2023 Lyon Chicago,Paris 200
2 11 2023 Berlin Paris 300
3 12 2022 Newyork Chicago 150
4 12 2022 Lyon Chicago,Paris 250
5 12 2022 Berlin Paris 400
df2
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
2 12 2022 Chicago 300
3 12 2022 Paris 350
I would like for each group having the same ID
& Year
:
to add the column Sales
from df2
to Sales
in df1
where Primary_Location
in df2
appear (contained) in Secondary_Location
in df1
.
For example: For ID=11
& Year=2023
, Sales
for Lyon
would be added to Sales
for Chicago
& Sales
for Paris
of df_2
.
New Sales
of Paris
for that row would be 200+150+200=550.
The expected output would be :
df_primary_output
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 400
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Here are the dataframes to start with :
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
EDIT: pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Would be great if the solution could work for these inputs as well:
df1
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 100
1 1 11 2023 Berlin Chicago 300
2 1 11 2022 Newyork Chicago 150
3 1 11 2022 Berlin Chicago 400
df2
Day ID Year Primary_Location Sales
0 1 11 2023 Chicago 150
1 1 11 2022 Chicago 300
The expected output would be :
df_primary_output
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 250
1 1 11 2023 Berlin Chicago 450
2 1 11 2022 Newyork Chicago 450
3 1 11 2022 Berlin Chicago 700
Upvotes: 1
Views: 235
Reputation: 2263
Not so easy your question...
Proposed script
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
tot = []
def func(g, iterdf, len_df1, i = 0):
global tot
kv = {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
while i < len_df1:
row = next(iterdf)[1]
# Select specific df1 rows to modify by ID and Year criteria
if g['ID'][g.index[0]]==row['ID'] and g['Year'][g.index[0]]==row['Year']:
tot.append(row['Sales'] + sum([kv[town] for town in row['Secondary_Location'].split(',') if town in kv]))
i+=1
df2.groupby(['ID', 'Year'], sort=False).apply(lambda g: func(g, df1.iterrows(), len(df1)))
df1['Sales'] = tot
print(df1)
Result :
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Are you sure of the result in line 3, my script found 450 and not 400 ?
Explanation :
1 - group(...).apply(...)
sends two groups from df2
one by one to func()
:
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
ID Year Primary_Location Sales
2 12 2022 Chicago 300
3 12 2022 Paris 350
2 - kv
returns dictionnaries from df2
like this :
(each iteration corresponds to a group ie ID + Year)
call 1 - {'Chicago': 100, 'Paris': 200}
call 2 - {'Chicago': 300, 'Paris': 350}
3 - Function while
followed by used of next(iterator)
allows to explore rows in g
(group) one by one :
while i < len_df1:
row = next(iterdf)[1]
...
i+=1
4 - The if
condition in while
loop allows to filter df1
rows in order that ID
and Year
corresponds to df2
's ones.
And for each correspondance to append to the global list tot
df1
and df2
sales values
5 - tot
is a global list for memorizing values and is passed to df1
for the Sales
column creation :
df1['Sales'] = tot
Result with the new dataframes sample :
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Berlin Chicago 450
2 11 2022 Newyork Chicago 450
3 11 2022 Berlin Chicago 700
Upvotes: 1
Reputation: 8768
This should work:
s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))
or
df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(',')) #split Secondary_Location column into list and explode it so each row has one value
.explode('Secondary_Location'))
(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1) #create a series where ID, Year and Secondary_Location are a combined into a tuple so we can map our series created below to bring in the values needed.
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales']) #create a series with lookup values in index, and make a series by selecting Sales column
.groupby(level=0).sum() #when exploding the column above, the index was repeated, so groupby(level=0).sum() will combine back to original form.
.add(df1['Sales'])) #add in original sales column
Original Answer:
s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))
Output:
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Upvotes: 2