Efficient way to reference one table from another

Question

I have the following two data frames:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [3, 4]),
                    'date': ['2019-01', '2019-02', '2019-03', '2018-01', '2018-02', '2018-03', '2018-04'],
                    'value': [1,2,3,4,5,6,7]})

df2 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [2, 2]),
                    'date': ['2019-02', '2019-03', '2018-01', '2018-04']})

print(df1)
#      name     date  value
#0  Brandon  2019-01      1
#1  Brandon  2019-02      2
#2  Brandon  2019-03      3
#3    Erica  2018-01      4
#4    Erica  2018-02      5
#5    Erica  2018-03      6
#6    Erica  2018-04      7


print(df2)
#      name     date
#0  Brandon  2019-02
#1  Brandon  2019-03
#2    Erica  2018-01
#3    Erica  2018-04

The goal is to add a new field to df2 that sums the previous two rows of df1 for the name and date fields from df2. If there are not two rows with the same name in df1 prior to the name, date combination from df2, then the new field will be NaN.

Below is the code that I wrote to accomplish this, but I am interested in something more efficient than the for loop below. Thanks in advance.

value_sums = []
for i in range(len(df2)):
    f = df1.index[(df1['name'] == df2.iloc[i]['name']) & (df1['date'] == df2.iloc[i]['date'])].tolist()[0]
    if df1.iloc[f]['name'] != df1.iloc[(f-2)]['name']:
        value_sums.append(np.NaN)
    else:
        g = df1.iloc[(f-2):f]['value'].sum()
        value_sums.append(g)

df2['value_sums'] = value_sums

print(df2)
#      name     date  value_sums
#0  Brandon  2019-02         NaN
#1  Brandon  2019-03         3.0
#2    Erica  2018-01         NaN
#3    Erica  2018-04        11.0

Nick Becker · Accepted Answer

I think the following would do what you're describing. It makes the assumption that the dates are sorted in ascending order within each name group, like in your example. You could enforce this with a sort, if it's not the case. As it avoids an explicit loop through the rows, it should be much faster at scale.

First, your initial setup:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [3, 4]),
                    'date': ['2019-01', '2019-02', '2019-03', '2018-01', '2018-02', '2018-03', '2018-04'],
                    'value': [1,2,3,4,5,6,7]})

df2 = pd.DataFrame({'name': np.repeat(['Brandon', 'Erica'], [2, 2]),
                    'date': ['2019-02', '2019-03', '2018-01', '2018-04']})

print(df1)
#      name     date  value
#0  Brandon  2019-01      1
#1  Brandon  2019-02      2
#2  Brandon  2019-03      3
#3    Erica  2018-01      4
#4    Erica  2018-02      5
#5    Erica  2018-03      6
#6    Erica  2018-04      7


print(df2)
#      name     date
#0  Brandon  2019-02
#1  Brandon  2019-03
#2    Erica  2018-01
#3    Erica  2018-04

Next, let's implement your logic:

df1['value_sum'] = df1.groupby(['name']).shift(1).rolling(2).value.sum()
df1
name    date    value   value_sum
0   Brandon 2019-01 1   NaN
1   Brandon 2019-02 2   NaN
2   Brandon 2019-03 3   3.0
3   Erica   2018-01 4   NaN
4   Erica   2018-02 5   NaN
5   Erica   2018-03 6   9.0
6   Erica   2018-04 7   11.0

We grouped by name, shifted by one, and then took a rolling sum with window size two. The shift and rolling window of size two make sure we're looking "backward two" for the sum.

Finally, we merge the dataframes:

df2.merge(df1[['name', 'date', 'value_sum']], how='left', on=['name', 'date'])
name    date    value_sum
0   Brandon 2019-02 NaN
1   Brandon 2019-03 3.0
2   Erica   2018-01 NaN
3   Erica   2018-04 11.0

Efficient way to reference one table from another

Answers (1)

Related Questions