Reputation: 1223
I have the following two data frames:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [3, 4]),
'date': ['2019-01', '2019-02', '2019-03', '2018-01', '2018-02', '2018-03', '2018-04'],
'value': [1,2,3,4,5,6,7]})
df2 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [2, 2]),
'date': ['2019-02', '2019-03', '2018-01', '2018-04']})
print(df1)
# name date value
#0 Brandon 2019-01 1
#1 Brandon 2019-02 2
#2 Brandon 2019-03 3
#3 Erica 2018-01 4
#4 Erica 2018-02 5
#5 Erica 2018-03 6
#6 Erica 2018-04 7
print(df2)
# name date
#0 Brandon 2019-02
#1 Brandon 2019-03
#2 Erica 2018-01
#3 Erica 2018-04
The goal is to add a new field to df2
that sums the previous two rows of df1
for the name
and date
fields from df2
. If there are not two rows with the same name in df1
prior to the name
, date
combination from df2
, then the new field will be NaN
.
Below is the code that I wrote to accomplish this, but I am interested in something more efficient than the for loop below. Thanks in advance.
value_sums = []
for i in range(len(df2)):
f = df1.index[(df1['name'] == df2.iloc[i]['name']) & (df1['date'] == df2.iloc[i]['date'])].tolist()[0]
if df1.iloc[f]['name'] != df1.iloc[(f-2)]['name']:
value_sums.append(np.NaN)
else:
g = df1.iloc[(f-2):f]['value'].sum()
value_sums.append(g)
df2['value_sums'] = value_sums
print(df2)
# name date value_sums
#0 Brandon 2019-02 NaN
#1 Brandon 2019-03 3.0
#2 Erica 2018-01 NaN
#3 Erica 2018-04 11.0
Upvotes: 0
Views: 38
Reputation: 4224
I think the following would do what you're describing. It makes the assumption that the dates are sorted in ascending order within each name
group, like in your example. You could enforce this with a sort, if it's not the case. As it avoids an explicit loop through the rows, it should be much faster at scale.
First, your initial setup:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [3, 4]),
'date': ['2019-01', '2019-02', '2019-03', '2018-01', '2018-02', '2018-03', '2018-04'],
'value': [1,2,3,4,5,6,7]})
df2 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [2, 2]),
'date': ['2019-02', '2019-03', '2018-01', '2018-04']})
print(df1)
# name date value
#0 Brandon 2019-01 1
#1 Brandon 2019-02 2
#2 Brandon 2019-03 3
#3 Erica 2018-01 4
#4 Erica 2018-02 5
#5 Erica 2018-03 6
#6 Erica 2018-04 7
print(df2)
# name date
#0 Brandon 2019-02
#1 Brandon 2019-03
#2 Erica 2018-01
#3 Erica 2018-04
Next, let's implement your logic:
df1['value_sum'] = df1.groupby(['name']).shift(1).rolling(2).value.sum()
df1
name date value value_sum
0 Brandon 2019-01 1 NaN
1 Brandon 2019-02 2 NaN
2 Brandon 2019-03 3 3.0
3 Erica 2018-01 4 NaN
4 Erica 2018-02 5 NaN
5 Erica 2018-03 6 9.0
6 Erica 2018-04 7 11.0
We grouped by name, shifted by one, and then took a rolling sum with window size two. The shift and rolling window of size two make sure we're looking "backward two" for the sum.
Finally, we merge the dataframes:
df2.merge(df1[['name', 'date', 'value_sum']], how='left', on=['name', 'date'])
name date value_sum
0 Brandon 2019-02 NaN
1 Brandon 2019-03 3.0
2 Erica 2018-01 NaN
3 Erica 2018-04 11.0
Upvotes: 1