Reputation: 1
Right, so I'm a bit rusty with python (pulling it out after 4yrs) and was looking for a solution to this problem. While there were similar threads i wasn't able to figure out what I'm doing wrong.
I have some data that looks like this:
print (fwds)
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP \
Date
2019-10-15 1.47518 -0.503679 0.681473 1.84996 -0.804212 0.626394
2019-10-14 NaN -0.513647 0.684232 NaN -0.815201 0.643280
2019-10-11 1.51515 -0.520474 0.654544 1.84918 -0.812819 0.697584
2019-10-10 1.39085 -0.538651 0.564055 1.72812 -0.846291 0.546696
2019-10-09 1.30827 -0.568942 0.564897 1.63652 -0.896871 0.479307
... ... ... ... ... ... ...
1995-01-09 8.59473 NaN 10.830200 9.59729 NaN 9.407250
1995-01-06 8.58316 NaN 10.851200 9.42043 NaN 9.434480
1995-01-05 8.56470 NaN 10.839000 9.51209 NaN 9.560490
1995-01-04 8.44306 NaN 10.745900 9.51142 NaN 9.507650
1995-01-03 8.58847 NaN NaN 9.38380 NaN 9.611590
The problem is the data quality is not great and I need to remove outliers on a rolling basis (since these time series have been trending and using a static ZS will not work).
I tried a few solutions. One was to try and get a rolling zscore and then filter for the large ones. However, when I try calculating the zscore, my result is all NaNs:
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0, skipna=True).shift(1)
z = (x-m)/s
return z
print (fwds)
print (zscore(fwds, 200))
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP 1y1yJPY \
Date
2019-10-15 NaN NaN NaN NaN NaN NaN NaN
2019-10-14 NaN NaN NaN NaN NaN NaN NaN
2019-10-11 NaN NaN NaN NaN NaN NaN NaN
2019-10-10 NaN NaN NaN NaN NaN NaN NaN
2019-10-09 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1995-01-09 NaN NaN NaN NaN NaN NaN NaN
1995-01-06 NaN NaN NaN NaN NaN NaN NaN
1995-01-05 NaN NaN NaN NaN NaN NaN NaN
1995-01-04 NaN NaN NaN NaN NaN NaN NaN
1995-01-03 NaN NaN NaN NaN NaN NaN NaN
Another approach:
r = fwds.rolling(window=200)
large = r.mean() + 4 * r.std()
small = r.mean() - 4 * r.std()
print(fwds[fwds > mps])
print (fwds[fwds < mps])
returns:
1y1yUSD 1y1yEUR 1y1yAUD 1y1yCAD 1y1yCHF 1y1yGBP 1y1yJPY \
Date
2019-10-15 NaN NaN NaN NaN NaN NaN NaN
2019-10-14 NaN NaN NaN NaN NaN NaN NaN
2019-10-11 NaN NaN NaN NaN NaN NaN NaN
2019-10-10 NaN NaN NaN NaN NaN NaN NaN
2019-10-09 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
1995-01-09 NaN NaN NaN NaN NaN NaN NaN
1995-01-06 NaN NaN NaN NaN NaN NaN NaN
1995-01-05 NaN NaN NaN NaN NaN NaN NaN
1995-01-04 NaN NaN NaN NaN NaN NaN NaN
1995-01-03 NaN NaN NaN NaN NaN NaN NaN
for both max and min as well. Anyone have any idea how to deal with these darn NaNs when calculating rolling stdev or zscores?
Any hints appreciated. Thanks!
Edit: For further clarity, I was hoping to remove things like the spike in the green and brown lines from the chart systematically:
fwds.plot()
Link below: https://i.sstatic.net/udu5O.png
Upvotes: 0
Views: 295
Reputation: 2919
Welcome to stack overflow.... depending on your use case (and how many crazy extreme values there are) data interpolation should fit the bill....
Since you're looking at forwards (I think), interpolation should be statistically sound unless some of your missing values are the result of massive disruption in the market.
You can use pandas' DataFrame.interpolate
to fill in your NaN values with interpolated values.
From the docs
Filling in NaN in a Series via linear interpolation.
>>> s = pd.Series([0, 1, np.nan, 3])
>>> s
0 0.0
1 1.0
2 NaN
3 3.0
dtype: float64
>>> s.interpolate()
0 0.0
1 1.0
2 2.0
3 3.0
dtype: float64
Edit I just realized you are looking for market dislocations so you may not want to use linear interpolation as that will mute the effect of missing data
Upvotes: 0