Reputation: 49
I was able to find a couple of similar questions to this one, but I don't believe they answer what I am trying to accomplish.
Quite simply I am trying to use rolling.min with a variable window length from another column in the df. My dataset will probably get pretty large in the future so I am trying to pull this off without having to use a loop
pvdata['templow'] = pvdata['low'].rolling(window = pvdata['counter']).min()
Trying to run this I get: ValueError: window must be an integer 0 or greater, I believe the issue is that I am passing a series instead of an integer for the window length so my next attempt was to use apply lambda:
pvdata['templow'] = pvdata['counter'].apply(lambda x: pvdata['low'].rolling(window = x).min())
This returned the error: ValueError: Expected a 1D array, got an array with shape (10989, 10989), which I cannot explain.
Here is an example of the output I am looking for:
Is this possible to do without a loop?
Upvotes: 1
Views: 1030
Reputation: 3
Check this out. I created a variable size rolling statistics using a custom window indexer
import numpy as np
import pandas as pd
from pandas.api.indexers import BaseIndexer
class VariableWindowIndexer(BaseIndexer):
def __init__(self, window_size, max_periods=None):
super().__init__()
self.window_size = window_size.values
self.max_periods = max_periods
def get_window_bounds(self, num_values, min_periods, center, closed, step):
temp = np.arange(0, num_values, 1)
if self.max_periods is not None:
self.window_size = np.where(
self.window_size <= self.max_periods,
self.window_size,
self.max_periods
)
window_end = temp + 1
window_start = window_end - self.window_size
return window_start, window_end
@Stix, you can change window_size to your counter column
horizon = 1
lag = df["y"].shift(horizon)
window_size = df.index + 1
rolling = lag.rolling(
VariableWindowIndexer(window_size=window_size, max_periods=None),
min_periods=1,
center=False
)
df[["y_rolling_mean", "y_rolling_std"]] = rolling.agg(["mean", "std"])
t | y | lag | y_rolling_mean | y_rolling_std | |
---|---|---|---|---|---|
0 | 1 | 1 | NaN | NaN | NaN |
1 | 2 | 2 | 1.0 | 1.0 | NaN |
2 | 3 | 3 | 2.0 | 1.5 | 0.707107 |
3 | 4 | 4 | 3.0 | 2.0 | 1.000000 |
4 | 5 | 5 | 4.0 | 2.5 | 1.290994 |
5 | 6 | 6 | 5.0 | 3.0 | 1.581139 |
6 | 7 | 7 | 6.0 | 3.5 | 1.870829 |
7 | 8 | 8 | 7.0 | 4.0 | 2.160247 |
8 | 9 | 9 | 8.0 | 4.5 | 2.449490 |
9 | 10 | 10 | 9.0 | 5.0 | 2.738613 |
10 | 11 | 11 | 10.0 | 5.5 | 3.027650 |
11 | 12 | 12 | 11.0 | 6.0 | 3.316625 |
12 | 13 | 13 | 12.0 | 6.5 | 3.605551 |
13 | 14 | 14 | 13.0 | 7.0 | 3.894440 |
Upvotes: 0
Reputation: 93191
You cannot call rolling
with a variable window size. Depend on how many unique window sizes there are in the counter
column, it maybe faster to call rolling
with each size and combine the results.
n = 1_000_000
m = 10 # The number of unique values in `counter`
np.random.seed(42)
low = np.random.randint(1, 100, n)
counter = np.random.choice(np.arange(1, m + 1), n)
pvdata = pd.DataFrame({"low": low, "counter": counter})
codes, uniques = pd.factorize(pvdata["counter"], sort=True)
arr = np.vstack([pvdata["low"].rolling(u, min_periods=1).min() for u in uniques]).T
templow = arr[range(len(pvdata)), codes]
n = len(pvdata)
low = pvdata["low"].to_numpy()
counter = pvdata["counter"].to_numpy()
templow = np.repeat(np.nan, n)
for i, c in zip(range(n), counter):
lb = max(0, i - c + 1)
ub = i + 1
templow[i] = low[lb:ub].min()
As m
increases, the fast method will use more CPU and memory because:
rolling
more.rolling
calls. arr
requires n x m x 8
bytes of memory.The "fast" method becomes slower as m
increases. During my test, both methods are about equal at m = 150
, after which "fast" becomes the slower one. Experiment with your data and pick an algorithm that suits your needs.
Upvotes: 1