Reputation: 229
I am trying to verify the ewm.std calculations of pandas so that I can implement a one step update for my code. Here is the complete description of the problem with code.
mrt = pd.Series(np.random.randn(1000))
N = 100
a = 2/(1+N)
bias = (2-a)/2/(1-a)
x = mrt.iloc[-2]
ma = mrt.ewm(span=N).mean().iloc[-3]
var = mrt.ewm(span=N).var().iloc[-3]
ans = mrt.ewm(span=N).std().iloc[-2]
print(np.sqrt( bias*(1-a) * (var + a * (x- ma)**2)), ans)
(1.1352524643949702, 1.1436193844674576)
I have used standard formulation. Could somebody tell me why the two values should not be same? i.e. how is pandas calculating the exponentially weighted std?
EDIT: After Julien's answer - let me give one more use case. I am plotting the ratio of the var calculated by pandas and using the formula I inferred from the Cython code of pandas ewm-covariance. This ratio should be 1. (I am guessing there is a problem with my formula, if somebody can point it out).
mrt = pd.Series(np.random.randn(1000))
N = 100
a = 2./(1+N)
bias = (2-a)/2./(1-a)
mewma = mrt.ewm(span=N).mean()
var_pandas = mrt.ewm(span=N).var()
var_calculated = bias * (1-a) * (var_pandas.shift(1) + a * (mrt-mewma.shift(1))**2)
(var_calculated/var_pandas).plot()
The plot shows the problem clearly.
EDIT 2: By trial and error, I figured out the right formula:
var_calculated = (1-a) * (var_pandas.shift(1) + bias * a * (mrt-mewma.shift(1))**2)
But I am not convinced that it should be the right one! Can somebody put light on that?
Upvotes: 6
Views: 16662
Reputation: 1883
Thank you @kosnik for the answer above. Copy-pasted your code below and build it up on it to answer the question here. Thus calculated exponential moving variance and standard deviation for entire dataset. The calculated values match with the output of .ewm()
.
# Import libraries
import numpy as np
import pandas as pd
# Create DataFrame
mrt = pd.Series(np.random.randn(1000))
df = pd.DataFrame(data=mrt, columns=['data'])
# Initialize
N = 3 # Span
a = 2./(1+N) # Alpha
# Use .evm() to calculate 'exponential moving variance' directly
var_pandas = df.ewm(span=N).var()
std_pandas = df.ewm(span=N).std()
# Initialize variable
varcalc=[]
stdcalc=[]
# Calculate exponential moving variance
for i in range(0,len(df.data)):
z = np.array(df.data.iloc[0:i+1].tolist())
# Get weights: w
n = len(z)
w = (1-a)**np.arange(n-1, -1, -1) # This is reverse order to match Series order
# Calculate exponential moving average
ewma = np.sum(w * z) / np.sum(w)
# Calculate bias
bias = np.sum(w)**2 / (np.sum(w)**2 - np.sum(w**2))
# Calculate exponential moving variance with bias
ewmvar = bias * np.sum(w * (z - ewma)**2) / np.sum(w)
# Calculate standard deviation
ewmstd = np.sqrt(ewmvar)
# Append
varcalc.append(ewmvar)
stdcalc.append(ewmstd)
#print('ewmvar:',ewmvar)
#varcalc
df['var_pandas'] = var_pandas
df['varcalc'] = varcalc
df['std_pandas'] = std_pandas
df['stdcalc'] = stdcalc
df
Upvotes: 3
Reputation: 2434
According to the documentation of the ewm
function, the default flag adjust=True
is used. As it is explained in the links below, the exponentially weighted moving values are not computed using the recursive relations, but instead using weights. This is justified, particularly for the cases when the length of the series is small.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html https://github.com/pandas-dev/pandas/issues/8861
This means that the ewma
and ewmvar
are computed as normal weighted average and var with the weights being the exponentially decreasing factors
mrt_array = np.array(mrt.tolist())
M = len(mrt_array)
weights = (1-a)**np.arange(M-1, -1, -1) # This is reverse order to match Series order
ewma = sum(weights * mrt_array) / sum(weights)
bias = sum(weights)**2 / (sum(weights)**2 - sum(weights**2))
ewmvar = bias * sum(weights * (mrt_array - ewma)**2) / sum(weights)
ewmstd = np.sqrt(ewmvar)
Upvotes: 8
Reputation: 11905
Your question actually actually reduces to how pandas calculate ewm.var()
In [1]:
(np.sqrt(mrt.ewm(span=span).var()) == mrt.ewm(span=span).std())[1:].value_counts()
Out[1]:
True 999
dtype: int64
So in your example above: ans == np.sqrt(mrt.ewm(span=N).var().iloc[-2])
.
To investigate how it calculates ewmvar(), it does it by calling emcov with input_x=input_y=mrt
If we check for the first elements:
mrt.ewm(span=span).var()[:2].values
> array([nan, 0.00555309])
Now, using the emcov routine, and applying it to our specific case:
x0 = mrt.iloc[0]
x1 = mrt.iloc[1]
x2 = mrt.iloc[2]
# mean_x and mean_y are both the same, here we call it y
# This is the same as mrt.ewm(span=span).mean(), I verified that too
y0 = x0
# y1 = mrt.ewm(span=span).mean().iloc[1]
y1 = ((1-alpha)*y0 + x1)/(1+(1-alpha))
#y2 = (((1-alpha)**2+(1-alpha))*y1 + x2) / (1 + (1-alpha) + (1-alpha)**2)
cov0 = 0
cov1 = (((1-alpha) * (cov0 + ((y0 - y1)**2))) +
(1 * ((x1 - y1)**2))) / (1 + (1-alpha))
# new_wt = 1, sum_wt0 = (1-alpha), sum_wt2 = (1-alpha)**2
sum_wt = 1+(1-alpha)
sum_wt2 =1+(1-alpha)**2
numerator = sum_wt * sum_wt # (1+(1-alpha))^2 = 1 + 2(1-alpha) + (1-alpha)^2
denominator = numerator - sum_wt2 # # 2*(1-alpha)
print(np.nan,cov1*(numerator / denominator))
>(nan, 0.0055530905712123432)
Upvotes: 6