Reputation: 6263
I want to calculate the rolling weighted mean of a time series and the average to be calculated over a specific time interval. For example, this calculated the rolling mean with a 90-day window (not weighted):
import numpy as np
import pandas as pd
data = np.random.randint(0, 1000, (1000, 10))
index = pd.date_range("20190101", periods=1000, freq="18H")
df = pd.DataFrame(index=index, data=data)
df = df.rolling("90D").mean()
However, when I apply a weighting function (line below) I get an error: "ValueError: Invalid window 90D"
df = df.rolling("90D", win_type="gaussian").mean(std=60)
On the other hand, the weighted average works if I make the window an integer instead of an offset:
df = df.rolling(90, win_type="gaussian").mean(std=60)
Using an integer does not work for my application since the observations are not evenly spaced in time.
Two questions:
can I do a weighted rolling mean with an offset (e.g. "90D" or "3M"?
If I can do a weighted rolling mean with an offset, then what does std refer to when I specify window="90D" and win_type="gaussian"; does it mean the std is 60D?
Upvotes: 1
Views: 511
Reputation: 392
Okey, I discoveret that its not implemented yet in pandas.
Look here: https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/core/window.py
If you follow line 2844 you see that when win_type is not None a Window object is returned:
if win_type is not None:
return Window(obj, win_type=win_type, **kwds)
Then check the validate method of the window object at line 630, it only allows integer or list-like windows
I think this is because pandas uses scipy.signal library which receives an array, so it cannot take into account the distribution of your data over time.
You could implement your own weighting function and use apply but its performance won't be too good.
Upvotes: 1
Reputation: 176
It is not clear to me what you wants the weights in your weighted average to be but is the weight a measure of the time for which an observation is 'in effect'?
If so, I believe you can re-index the dataframe so it has regularly-spaced observations. Then fill NAs appropriately - see method
in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
That will allow rolling
to work and also help you think explicitly about how missing observations are treated, for instance should a missing sample take its value from the last valid sample or the nearest sample.
Upvotes: 0