Pandas rolling with variable window length

Question

I was able to find a couple of similar questions to this one, but I don't believe they answer what I am trying to accomplish.

Quite simply I am trying to use rolling.min with a variable window length from another column in the df. My dataset will probably get pretty large in the future so I am trying to pull this off without having to use a loop

pvdata['templow'] = pvdata['low'].rolling(window = pvdata['counter']).min()

Trying to run this I get: ValueError: window must be an integer 0 or greater, I believe the issue is that I am passing a series instead of an integer for the window length so my next attempt was to use apply lambda:

pvdata['templow'] = pvdata['counter'].apply(lambda x: pvdata['low'].rolling(window = x).min())

This returned the error: ValueError: Expected a 1D array, got an array with shape (10989, 10989), which I cannot explain.

Here is an example of the output I am looking for:

Is this possible to do without a loop?

Code Different · Accepted Answer

You cannot call rolling with a variable window size. Depend on how many unique window sizes there are in the counter column, it maybe faster to call rolling with each size and combine the results.

Sample data

n = 1_000_000
m = 10 # The number of unique values in `counter`

np.random.seed(42)
low = np.random.randint(1, 100, n)
counter = np.random.choice(np.arange(1, m + 1), n)

pvdata = pd.DataFrame({"low": low, "counter": counter})

The fast method, but with limitations

codes, uniques = pd.factorize(pvdata["counter"], sort=True)
arr = np.vstack([pvdata["low"].rolling(u, min_periods=1).min() for u in uniques]).T
templow = arr[range(len(pvdata)), codes]

The slow but reliable method

n = len(pvdata)
low = pvdata["low"].to_numpy()
counter = pvdata["counter"].to_numpy()

templow = np.repeat(np.nan, n)
for i, c in zip(range(n), counter):
    lb = max(0, i - c + 1)
    ub = i + 1
    templow[i] = low[lb:ub].min()

Discussion

As m increases, the fast method will use more CPU and memory because:

It calls rolling more.
It must store the result of all those rolling calls. arr requires n x m x 8 bytes of memory.

The "fast" method becomes slower as m increases. During my test, both methods are about equal at m = 150, after which "fast" becomes the slower one. Experiment with your data and pick an algorithm that suits your needs.

	t	y	lag	y_rolling_mean	y_rolling_std
0	1	1	NaN	NaN	NaN
1	2	2	1.0	1.0	NaN
2	3	3	2.0	1.5	0.707107
3	4	4	3.0	2.0	1.000000
4	5	5	4.0	2.5	1.290994
5	6	6	5.0	3.0	1.581139
6	7	7	6.0	3.5	1.870829
7	8	8	7.0	4.0	2.160247
8	9	9	8.0	4.5	2.449490
9	10	10	9.0	5.0	2.738613
10	11	11	10.0	5.5	3.027650
11	12	12	11.0	6.0	3.316625
12	13	13	12.0	6.5	3.605551
13	14	14	13.0	7.0	3.894440

Pandas rolling with variable window length

Answers (2)

Sample data

The fast method, but with limitations

The slow but reliable method

Discussion

Related Questions