alberto
alberto

Reputation: 63

Pandas SUMIF equivalent for two dataframes

I have a DataFrame(df1) that looks like this with a store and startdate/enddate for each one:

    df1 = pd.DataFrame(data={'store': ['X','Y','Z'], 'startdate': ['2020-02-03', '2020-03-05', '2020-04-01'], 'enddate': ['2020-03-05', '2020-05-02', '2020-06-07']})
df1

And I have a second DataFrame (df2) that looks like this and has invoice records for the different stores:

df2 = pd.DataFrame(data={'store': ['X','X','X','Y','Y'], 'invoicedate': ['2020-01-03','2020-02-05','2020-03-04', '2020-05-01', '2020-04-04'], 'sales': [153, 156, 12, 42, 48],})
df2

I want to add a column, we can call it df1['totalsales'], to the first DataFrame (df1) that sums up df2['sales'] when...

  1. df1['store']=df2['store']
  2. df2['invoicedate'] >= df1['startdate'] **&** df2['invoicedate'] <= df1['enddate']

Upvotes: 1

Views: 1500

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

Update easy fix:

df1.merge(df2, on='store').query('startdate <= invoicedate <= enddate')\
 .groupby(['store', 'startdate', 'enddate'])[['sales']].sum()\
 .reindex(pd.MultiIndex.from_frame(df1), fill_value=0)\
 .reset_index()

Output:

  store   startdate     enddate  sales
0     X  2020-02-03  2020-03-05    168
1     Y  2020-03-05  2020-05-02     90
2     Z  2020-04-01  2020-06-07      0

IIUC, let's use merge and query to filter results then groupby with sum:

df1.merge(df2, on='store').query('startdate <= invoicedate <= enddate')\
   .groupby(['store', 'startdate', 'enddate'])[['sales']].sum().reset_index()

Output:

  store   startdate     enddate  sales
0     X  2020-02-03  2020-03-05    168
1     Y  2020-03-05  2020-05-02     90

Upvotes: 2

kait
kait

Reputation: 1357

If you wanted to keep all of the rows in df1, you could use this:

def get_total_sales(x):
    mask = df2.store == x.store
    mask &= df2.invoicedate > x.startdate
    mask &= df2.invoicedate <= x.enddate
    x['total_sales'] = df2[mask].sales.sum()
    return x
df1.apply(lambda x: get_total_sales(x), axis=1)

Output:

store   startdate     enddate  total_sales
    X  2020-02-03  2020-03-05          168
    Y  2020-03-05  2020-05-02           90
    Z  2020-04-01  2020-06-07            0

Upvotes: 3

Related Questions