desmond
desmond

Reputation: 2081

How to use Pandas vector methods based on rolling custom function that involves entire row and prior data

While its easy to use pandas rolling method to apply standard formulas, but i find it hard if it involves multiple column with limited past rows. Using the following code to better elaborate: -

import numpy as np
import pandas as pd

#create dummy pandas
df=pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})

def func1(shortdf):
    #dummy formula 
    #use last row of col1 multiply by sum of col2
    return (shortdf.col1.tail(1).values[0]+shortdf.col2.sum())*3.14

for idx, i in df.iterrows():
    if idx>3:
        #only interested in the last 3 rows from position of dataframe
        df.loc[idx,'col3']=func1(df.iloc[idx-3:idx])

I currently use this iterrow method which needless to say is extremely slow. can anyone has better suggestion?

Upvotes: 1

Views: 203

Answers (1)

bbd108
bbd108

Reputation: 998

Option 1

So shift is the solution here. You do have to use rolling for the summation, and then shift that series after the addition and multiplication.

df = pd.DataFrame({'col1':np.arange(0,25),'col2':np.arange(100,125),'col3':np.nan})

ans = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)

You can check to see that ans is the same as df['col3'] by using ans.eq(df['col3']). Once you see that all but the first few are the same, just change ans to df['col3'] and you should be all set.

Option 2

Without additional information about the customized weight function, it is hard to help. However, this option may be a solution as it separates the rolling calculation at the cost of using more memory.

# df['col3'] = ((df['col1'] + df['col2'].rolling(3).sum()) * 3.14).shift(1)

s = df['col2']

stride = pd.DataFrame([s.shift(x).values[::-1][:3] for x in range(len(s))[::-1]])
res = pd.concat([df, stride], axis=1)

# here you can perform your custom weight function
res['final'] = ((res[0] + res[1] + res[2] + res['col1']) * 3.14).shift(1)

stride is adapted from this question and the calculation is concatenated row-wise to the original dataframe. In this way each column has the value needed to compute whatever it is you may need.

res['final'] is identical to option 1's ans

Upvotes: 1

Related Questions