Jack
Jack

Reputation: 41

Detecting outliers in a Pandas dataframe using a rolling standard deviation

I have a DataFrame for a fast Fourier transformed signal.

There is one column for the frequency in Hz and another column for the corresponding amplitude.

I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.

df = pd.DataFrame({'Data':np.random.normal(size=200)})  # example dataset of normally distributed data. 
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around

The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.

Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?

Upvotes: 4

Views: 5628

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40918

This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.

import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})

# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.    

r = df.rolling(window=20)  # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std()  # Combine a mean and stdev on that object

print(df[df.Data > mps.Data])  # Boolean filter
#     Data
# 55  40.0
# 80  40.0

To add a new column filtering only to outliers, with NaN elsewhere:

df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)

print(df.iloc[50:60])
        Data  Peaks
50  -1.29409    NaN
51  -1.03879    NaN
52   1.74371    NaN
53  -0.79806    NaN
54   0.02968    NaN
55  40.00000   40.0
56   0.89071    NaN
57   1.75489    NaN
58   1.49564    NaN
59   1.06939    NaN

Here .where returns

An object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Upvotes: 9

Related Questions