Reputation: 23
I'm trying to create an aggregating stock function, looping over dates. Basically, I want to sum a variable based on a grouping, and iterating over dates. I do have multiple entries for a given day, however, which complicates the problem slightly. I'm currently using pandas.groupby, but this runs way too slow for my >3m rows, so I'm wondering how I could vectorize it instead.
Represented by some example code, I would like to vectorize the below. Imagine I am warehousing something, say, apples, for a set of customers, and I'd like to keep track of how many apples I am warehousing for each customer on any given day. Each customer can make a 'deposit' of some apples from one day to a date in the future, and can make several deposits each day. The example code looks like this:
import pandas as pd
df = {'StartDate': ['2020-01-01','2020-01-01','2020-01-01','2020-01-02','2020-01-02','2020-01-02','2020-01-03','2020-01-04','2020-01-04','2020-01-05'],
'EndDate':['2020-01-02','2020-01-02','2020-01-05','2020-01-05','2020-01-03','2020-01-04','2020-01-06','2020-01-06','2020-01-06','2020-01-06'],
'Apples':[5,6,2,4,4,10,8,9,3,7],
'Customer':['A','B','A','C','A','A','B','B','C','A']}
df = pd.DataFrame(data=df)
cycledates = ['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06']
def stockfunction(indata, ingroupingvars, cycledates):
stock_agg = pd.DataFrame()
for i in cycledates:
stock_input = indata.loc[
(indata['StartDate'] <= i) & (indata['EndDate'] > i)]
stock_input['EvaluationDate'] = i
stock_marg = stock_input.groupby(ingroupingvars)['Apples'].sum().unstack(level=0)
stock_agg = stock_agg.append(stock_marg)
return stock_agg
ingroupingvars = ['Customer','EvaluationDate']
stock = stockfunction(df,ingroupingvars,cycledates)
Sample output
The dataframe looks like this:
StartDate EndDate Apples Customer
0 2020-01-01 2020-01-02 5 A
1 2020-01-01 2020-01-02 6 B
2 2020-01-01 2020-01-05 2 A
3 2020-01-02 2020-01-05 4 C
4 2020-01-02 2020-01-03 4 A
5 2020-01-02 2020-01-04 10 A
6 2020-01-03 2020-01-06 8 B
7 2020-01-04 2020-01-06 9 B
8 2020-01-04 2020-01-06 3 C
9 2020-01-05 2020-01-06 7 A
And the end-result looks like this:
Customer A B C
EvaluationDate
2020-01-01 7 6.0 NaN
2020-01-02 16 NaN 4.0
2020-01-03 12 8.0 4.0
2020-01-04 2 17.0 7.0
2020-01-05 7 17.0 3.0
...it's just that it takes forever to run on my original data.
Upvotes: 1
Views: 81
Reputation: 150785
Looks like you want to do a cross merge and query:
# convert to datetime type for safe comparison
Eval_df = pd.to_datetime(pd.Series(cycledates, name='EvaluationDate')).to_frame()
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
(df.assign(dummy=1)
.merge(cycledates.assign(dummy=1), on='dummy')
.query('StartDate<=EvaluationDate<EndDate')
.groupby(['EvaluationDate', 'Customer']) # ingroupingvars
['Apples'].sum() #
.unstack('Customer', fill_value=0) # remove fill_value=0 if you want the NaN
)
Output:
Customer A B C
EvaluationDate
2020-01-01 7 6 0
2020-01-02 16 0 4
2020-01-03 12 8 4
2020-01-04 2 17 7
2020-01-05 7 17 3
Upvotes: 1