Rolling stdev to remove outliers with NaNs

Question

Right, so I'm a bit rusty with python (pulling it out after 4yrs) and was looking for a solution to this problem. While there were similar threads i wasn't able to figure out what I'm doing wrong.

I have some data that looks like this:

print (fwds)
        1y1yUSD   1y1yEUR    1y1yAUD  1y1yCAD   1y1yCHF   1y1yGBP  \
Date                                                                    
2019-10-15  1.47518 -0.503679   0.681473  1.84996 -0.804212  0.626394   
2019-10-14      NaN -0.513647   0.684232      NaN -0.815201  0.643280   
2019-10-11  1.51515 -0.520474   0.654544  1.84918 -0.812819  0.697584   
2019-10-10  1.39085 -0.538651   0.564055  1.72812 -0.846291  0.546696   
2019-10-09  1.30827 -0.568942   0.564897  1.63652 -0.896871  0.479307   
...             ...       ...        ...      ...       ...       ...   
1995-01-09  8.59473       NaN  10.830200  9.59729       NaN  9.407250   
1995-01-06  8.58316       NaN  10.851200  9.42043       NaN  9.434480   
1995-01-05  8.56470       NaN  10.839000  9.51209       NaN  9.560490   
1995-01-04  8.44306       NaN  10.745900  9.51142       NaN  9.507650   
1995-01-03  8.58847       NaN        NaN  9.38380       NaN  9.611590

The problem is the data quality is not great and I need to remove outliers on a rolling basis (since these time series have been trending and using a static ZS will not work).

I tried a few solutions. One was to try and get a rolling zscore and then filter for the large ones. However, when I try calculating the zscore, my result is all NaNs:

def zscore(x, window):
    r = x.rolling(window=window)
    m = r.mean().shift(1)
    s = r.std(ddof=0, skipna=True).shift(1)
    z = (x-m)/s
    return z
print (fwds)
print (zscore(fwds, 200))
        1y1yUSD  1y1yEUR  1y1yAUD  1y1yCAD  1y1yCHF  1y1yGBP  1y1yJPY  \
Date                                                                        
2019-10-15      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-14      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-11      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-10      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
...             ...      ...      ...      ...      ...      ...      ...   
1995-01-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-06      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-05      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-04      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-03      NaN      NaN      NaN      NaN      NaN      NaN      NaN

Another approach:

r = fwds.rolling(window=200)
large = r.mean() + 4 * r.std()
small = r.mean() - 4 * r.std()
print(fwds[fwds > mps])
print (fwds[fwds < mps])

returns:

   1y1yUSD  1y1yEUR  1y1yAUD  1y1yCAD  1y1yCHF  1y1yGBP  1y1yJPY  \
Date                                                                        
2019-10-15      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-14      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-11      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-10      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
2019-10-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
...             ...      ...      ...      ...      ...      ...      ...   
1995-01-09      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-06      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-05      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-04      NaN      NaN      NaN      NaN      NaN      NaN      NaN   
1995-01-03      NaN      NaN      NaN      NaN      NaN      NaN      NaN

for both max and min as well. Anyone have any idea how to deal with these darn NaNs when calculating rolling stdev or zscores?

Any hints appreciated. Thanks!

Edit: For further clarity, I was hoping to remove things like the spike in the green and brown lines from the chart systematically:

fwds.plot()

Link below: https://i.sstatic.net/udu5O.png

Rolling stdev to remove outliers with NaNs

Answers (1)

Related Questions