gka
gka

Reputation: 23

Python: How to most effectively vectorize this for-loop (which uses pandas groupby)

I'm trying to create an aggregating stock function, looping over dates. Basically, I want to sum a variable based on a grouping, and iterating over dates. I do have multiple entries for a given day, however, which complicates the problem slightly. I'm currently using pandas.groupby, but this runs way too slow for my >3m rows, so I'm wondering how I could vectorize it instead.

Represented by some example code, I would like to vectorize the below. Imagine I am warehousing something, say, apples, for a set of customers, and I'd like to keep track of how many apples I am warehousing for each customer on any given day. Each customer can make a 'deposit' of some apples from one day to a date in the future, and can make several deposits each day. The example code looks like this:

import pandas as pd

df = {'StartDate': ['2020-01-01','2020-01-01','2020-01-01','2020-01-02','2020-01-02','2020-01-02','2020-01-03','2020-01-04','2020-01-04','2020-01-05'],
      'EndDate':['2020-01-02','2020-01-02','2020-01-05','2020-01-05','2020-01-03','2020-01-04','2020-01-06','2020-01-06','2020-01-06','2020-01-06'],
      'Apples':[5,6,2,4,4,10,8,9,3,7],
      'Customer':['A','B','A','C','A','A','B','B','C','A']}
df = pd.DataFrame(data=df)

cycledates = ['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06']

def stockfunction(indata, ingroupingvars, cycledates):
    stock_agg = pd.DataFrame()
    for i in cycledates:
        stock_input = indata.loc[
            (indata['StartDate'] <= i) & (indata['EndDate'] > i)]
        stock_input['EvaluationDate'] = i
        stock_marg = stock_input.groupby(ingroupingvars)['Apples'].sum().unstack(level=0)
        stock_agg = stock_agg.append(stock_marg)
    return stock_agg

ingroupingvars = ['Customer','EvaluationDate']
stock = stockfunction(df,ingroupingvars,cycledates)

Sample output

The dataframe looks like this:

StartDate     EndDate  Apples Customer
0  2020-01-01  2020-01-02       5        A
1  2020-01-01  2020-01-02       6        B
2  2020-01-01  2020-01-05       2        A
3  2020-01-02  2020-01-05       4        C
4  2020-01-02  2020-01-03       4        A
5  2020-01-02  2020-01-04      10        A
6  2020-01-03  2020-01-06       8        B
7  2020-01-04  2020-01-06       9        B
8  2020-01-04  2020-01-06       3        C
9  2020-01-05  2020-01-06       7        A

And the end-result looks like this:

Customer         A     B    C
EvaluationDate               
2020-01-01       7   6.0  NaN
2020-01-02      16   NaN  4.0
2020-01-03      12   8.0  4.0
2020-01-04       2  17.0  7.0
2020-01-05       7  17.0  3.0

...it's just that it takes forever to run on my original data.

Upvotes: 1

Views: 81

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

Looks like you want to do a cross merge and query:

# convert to datetime type for safe comparison
Eval_df = pd.to_datetime(pd.Series(cycledates, name='EvaluationDate')).to_frame()

df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])


(df.assign(dummy=1)
   .merge(cycledates.assign(dummy=1), on='dummy')
   .query('StartDate<=EvaluationDate<EndDate')
   .groupby(['EvaluationDate', 'Customer'])      # ingroupingvars
   ['Apples'].sum()                              #
   .unstack('Customer', fill_value=0)            # remove fill_value=0 if you want the NaN
)

Output:

Customer         A   B  C
EvaluationDate           
2020-01-01       7   6  0
2020-01-02      16   0  4
2020-01-03      12   8  4
2020-01-04       2  17  7
2020-01-05       7  17  3

Upvotes: 1

Related Questions