oddlotmacro
oddlotmacro

Reputation: 1

Rolling stdev to remove outliers with NaNs

Right, so I'm a bit rusty with python (pulling it out after 4yrs) and was looking for a solution to this problem. While there were similar threads i wasn't able to figure out what I'm doing wrong.

I have some data that looks like this:

print (fwds)
        1y1yUSD   1y1yEUR    1y1yAUD  1y1yCAD   1y1yCHF   1y1yGBP  \
Date                                                                    
2019-10-15  1.47518 -0.503679   0.681473  1.84996 -0.804212  0.626394   
2019-10-14      NaN -0.513647   0.684232      NaN -0.815201  0.643280   
2019-10-11  1.51515 -0.520474   0.654544  1.84918 -0.812819  0.697584   
2019-10-10  1.39085 -0.538651   0.564055  1.72812 -0.846291  0.546696   
2019-10-09  1.30827 -0.568942   0.564897  1.63652 -0.896871  0.479307   
...             ...       ...        ...      ...       ...       ...   
1995-01-09  8.59473       NaN  10.830200  9.59729       NaN  9.407250   
1995-01-06  8.58316       NaN  10.851200  9.42043       NaN  9.434480   
1995-01-05  8.56470       NaN  10.839000  9.51209       NaN  9.560490   
1995-01-04  8.44306       NaN  10.745900  9.51142       NaN  9.507650   
1995-01-03  8.58847       NaN        NaN  9.38380       NaN  9.611590   

The problem is the data quality is not great and I need to remove outliers on a rolling basis (since these time series have been trending and using a static ZS will not work).

I tried a few solutions. One was to try and get a rolling zscore and then filter for the large ones. However, when I try calculating the zscore, my result is all NaNs:

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0, skipna=True).shift(1)
    z = (x-m)/s
    return z
print (fwds)
print (zscore(fwds, 200))
        1y1yUSD  1y1yEUR  1y1yAUD  1y1yCAD  1y1yCHF  1y1yGBP  1y1yJPY  \
Date                                                                        
2019-10-15      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-14      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-11      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-10      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
...             ...      ...      ...      ...      ...      ...      ...   
1995-01-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-06      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-05      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-04      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-03      NaN      NaN      NaN      NaN      NaN      NaN      NaN 

Another approach:

r = fwds.rolling(window=200)
large = r.mean() + 4 * r.std()
small = r.mean() - 4 * r.std()
print(fwds[fwds > mps])
print (fwds[fwds < mps])

returns:

   1y1yUSD  1y1yEUR  1y1yAUD  1y1yCAD  1y1yCHF  1y1yGBP  1y1yJPY  \
Date                                                                        
2019-10-15      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-14      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-11      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-10      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
...             ...      ...      ...      ...      ...      ...      ...   
1995-01-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-06      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-05      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-04      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-03      NaN      NaN      NaN      NaN      NaN      NaN      NaN 

for both max and min as well. Anyone have any idea how to deal with these darn NaNs when calculating rolling stdev or zscores?

Any hints appreciated. Thanks!

Edit: For further clarity, I was hoping to remove things like the spike in the green and brown lines from the chart systematically:

fwds.plot()

Link below: https://i.sstatic.net/udu5O.png

Upvotes: 0

Views: 295

Answers (1)

zglin
zglin

Reputation: 2919

Welcome to stack overflow.... depending on your use case (and how many crazy extreme values there are) data interpolation should fit the bill....

Since you're looking at forwards (I think), interpolation should be statistically sound unless some of your missing values are the result of massive disruption in the market.

You can use pandas' DataFrame.interpolate to fill in your NaN values with interpolated values.

From the docs

Filling in NaN in a Series via linear interpolation.

>>> s = pd.Series([0, 1, np.nan, 3])
>>> s
0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64
>>> s.interpolate()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

Edit I just realized you are looking for market dislocations so you may not want to use linear interpolation as that will mute the effect of missing data

Upvotes: 0

Related Questions