desmond
desmond

Reputation: 2081

how to vectorise Pandas calculation that is based on last x rows of data

I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:

import pandas as pd
import numpy as np
from datetime import timedelta

df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')

def calculateC(A, dte):
    if A>0: #based on values has different cutoff length for trend prediction
        depth=10
    else:
        depth=20
    lastyear=(dte-timedelta(days=365)) 
    df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
    return df2.B.mean() #uses WLS in my model but for simplification replace with mean

for index, row in df.iterrows():
    if index>365:
        df.loc[index,'C']=calculateC(row.A, row.dte)

I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?

Upvotes: 7

Views: 996

Answers (2)

JohnE
JohnE

Reputation: 30424

I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.

df['result'] = np.where( df['A'] > 0,
                         df.shift(365).rolling(10).B.mean(),
                         df.shift(365).rolling(20).B.mean() )

The tough (slow) part of your code is this:

df2=df[df.dte<lastyear].head(depth)

However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:

df.shift(365).rolling(10).B.mean()

shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.

And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.

Upvotes: 3

Chih-Hsu Jack Lin
Chih-Hsu Jack Lin

Reputation: 453

I would try pandas.DataFrame.apply(func, axis=1)

def calculateC2(row):
    if row.name >365: # row.name is the index of the row
        if row.A >0: #based on values has different cutoff length for trend prediction
            depth=10
        else:
            depth=20
        lastyear=(row.dte-timedelta(days=365)) 
        df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
        print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean

df.apply(calculateC2,axis=1)

Upvotes: 0

Related Questions