AwhatLoop
AwhatLoop

Reputation: 31

Python/Pandas For Loop Time Series

I am working with panel time-series data and am struggling with creating a fast for loop, to sum up, the past 50 numbers at the current i. The data is like 600k rows, and it starts to churn around 30k. Is there a way to use pandas or Numpy to do the same at a fraction of the time?

The change column is of type float, with 4 decimals.

Index     Change
0         0.0410
1         0.0000
2         0.1201
...     ... 
74327     0.0000
74328     0.0231
74329     0.0109
74330     0.0462
SEQ_LEN = 50
for i in range(SEQ_LEN, len(df)):
    df.at[i, 'Change_Sum'] = sum(df['Change'][i-SEQ_LEN:i])

Any help would be highly appreciated! Thank you!

Upvotes: 2

Views: 437

Answers (2)

Bill Huang
Bill Huang

Reputation: 4648

Disclaimer: This solution cannot compete with .rolling(). Plus, if a .groupby() case, just do a df.groupby("group")["Change"].rolling(50).sum() and then reset index. Therefore please accept the other answer.

Explicit for loop can be avoided by translating your recursive partial sum into the difference of cumulative sum (cumsum). The formula:

Sum[x-50:x] = Sum[:x] - Sum[:x-50] = Cumsum[x] - Cumsum[x-50]

Code

For showcase purpose, I have shorten len(df["Change"]) to 10 and SEQ_LEN to 5. A million records completed almost immediately in this way.

import pandas as pd
import numpy as np

# data
SEQ_LEN = 5
np.random.seed(111)  # reproducibility
df = pd.DataFrame(
    data={
        "Change": np.random.normal(0, 1, 10)  # a million rows
    }
)

# step 1. Do cumsum
df["Change_Cumsum"] = df["Change"].cumsum()

# Step 2. calculate diff of cumsum: Sum[x-50:x] = Sum[:x] - Sum[:x-50]
df["Change_Sum"] = np.nan  # or zero as you wish
df.loc[SEQ_LEN:, "Change_Sum"] = df["Change_Cumsum"].values[SEQ_LEN:] - df["Change_Cumsum"].values[:(-SEQ_LEN)]
# add idx=SEQ_LEN-1
df.at[SEQ_LEN-1, "Change_Sum"] = df.at[SEQ_LEN-1, "Change_Cumsum"]

Output

df
Out[30]: 
     Change  Change_Cumsum  Change_Sum
0 -1.133838      -1.133838         NaN
1  0.384319      -0.749519         NaN
2  1.496554       0.747035         NaN
3 -0.355382       0.391652         NaN
4 -0.787534      -0.395881   -0.395881
5 -0.459439      -0.855320    0.278518
6 -0.059169      -0.914489   -0.164970
7 -0.354174      -1.268662   -2.015697
8 -0.735523      -2.004185   -2.395838
9 -1.183940      -3.188125   -2.792244

Upvotes: 2

B. Bogart
B. Bogart

Reputation: 1075

I tried this with 600k rows and the average time was 20.9 ms ± 1.35 ms

This will return a series with the rolling sum for the last 50 Change in the df:

 df['Change'].rolling(50).sum()

you can add it to a new column like so:

df['change50'] = df['Change'].rolling(50).sum()

Upvotes: 3

Related Questions