espogian
espogian

Reputation: 607

Filter outliers in DataFrame rows based on a recursive time-interval

I have the following DataFrame df:

ds                  y
2018-10-01 00:00    1.23
2018-10-01 01:00    2.21
2018-10-01 02:00    6.40
...                 ...
2018-10-02 00:00    3.21
2018-10-02 01:00    3.42
2018-10-03 02:00    2.99
...                 ...

That means that I have one value for y per each hour. I would like to filter the rows so that the values which are not inside the 6-sigma interval (3*std, -3*std) are dropped.

I'm able to do this for the entire DataFrame this way:

df = df[np.abs(df.y-df.y.mean()) <= (3*df.y.std())]

But I would like to do this in a per-day basis.

Please note that ds is a datetime64[ns] and y a float64.

Also, since my ultimate goal is to exclude outliers from data, can you suggest other viable options to accomplish this?

Upvotes: 1

Views: 92

Answers (1)

Scott Boston
Scott Boston

Reputation: 153510

Try this:

g = df.groupby(df.index.floor('D'))['y']
df[(np.abs(df.y - g.transform('mean')) <= (3*g.transform('std')))]

Upvotes: 0

Related Questions