Reputation: 3378
I am new to Pandas and am having a hard time finding this sum. Suppose I have the following dataframes:
df1 = pd.DataFrame({'Product1': ['A', 'A', 'B'],
'Date1': ['2019-02-01', '2019-12-15', '2019-03-01']})
df1['Date1'] = pd.to_datetime(df1['Date1'])
df1
Product1 Date1
A 2019-02-01
A 2019-12-15
B 2019-03-01
and
df2 = pd.DataFrame({'Product2': ['A', 'A', 'A', 'C', 'B', 'B'],
'Date2': ['2019-01-01', '2019-02-01', '2019-06-01', '2019-06-30', '2019-01-31', '2019-02-15'],
'Payment': [100, 150, 100, 500, 25, 25]})
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2
Product2 Date2 Payment
A 2019-01-01 100
A 2019-02-01 150
A 2019-06-01 100
C 2019-06-30 500
B 2019-01-31 25
B 2019-02-15 25
I would like to obtain the following result
Product1 Date1 Total
A 2019-02-01 250
A 2019-12-15 350
B 2019-03-01 50
where df1['Total']
is the sum of df2['Payment']
when df2['Product2'] = df1['Product1']
and df2['Date2'] <= df1['Date1']
.
The best I have done so far is using the conditional loop as follows
sum_list = []
for col1, col2 in zip(df1['Product1'].values, df1['Date1'].values):
cond = (df2['Product2'] == col1) & (df2['Date2'] <= col2)
sum_list.append(df2[cond]['Payment'].sum())
df1['Total'] = pd.DataFrame(sum_list)
df1
but it is extremely slow when dealing with million rows. I believe there must be a better way.
Upvotes: 2
Views: 1345
Reputation: 38510
Here is an approach that mirrors that in the OP. Rather than performing a one-to-many merge, this loops through each row of df1 and performs the check and calculations. It is likely a bit faster than the original as it employs a list comprehension rather than growing a list, though slower than a merge. The upside is that it will use less memory.
df1['Payment'] = \
[df2.loc[(df1.iloc[i, 0] == df2.Product2) & (df1.iloc[i, 1] > df2.Date2),
'Payment'].sum()
for i in range(df1.shape[0])]
This returns the desired result
df1
Product1 Date1 Payment
0 A 2019-02-01 100
1 A 2019-12-15 350
2 B 2019-03-01 50
Upvotes: 1
Reputation: 42916
Using merge
, query
, GroupBy.sum
:
mrg = df1.merge(df2, left_on='Product1', right_on='Product2')
mrg.query('Date2 <= Date1').groupby(['Product1', 'Date1'], as_index=False)['Payment'].sum()
Output
Product1 Date1 Payment
0 A 2019-02-01 250
1 A 2019-12-15 350
2 B 2019-03-01 50
Upvotes: 2
Reputation: 6246
You can use a join upfront, and then filter out the rows according to your criteria. A solution can look something like this:
import pandas as pd
df1 = pd.DataFrame({'Product1': ['A', 'A', 'B'],
'Date1': ['2019-02-01', '2019-12-15', '2019-03-01']})
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2 = pd.DataFrame({'Product2': ['A', 'A', 'A', 'C', 'B', 'B'],
'Date2': ['2019-01-01', '2019-02-01', '2019-06-01', '2019-06-30', '2019-01-31', '2019-02-15'],
'Payment': [100, 150, 100, 500, 25, 25]})
df2['Date2'] = pd.to_datetime(df2['Date2'])
#Join according to the products in both tables
temp = pd.merge(df1, df2, how='left', left_on='Product1', right_on='Product2')
#only keep those rows that meet the date criteria
temp = temp[temp['Date2'] <= temp['Date1']]
#aggregate results according to product and date.
out = temp.groupby(['Product1', 'Date1']).agg({'Payment':'sum'})
#Optional: If you wish to get back the columns normally
out = out.reset_index()
print(out.head())
Product1 Date1 Payment
0 A 2019-02-01 250
1 A 2019-12-15 350
2 B 2019-03-01 50
Upvotes: 1