Reputation: 607
I have the following DataFrame df
:
ds y
2018-10-01 00:00 1.23
2018-10-01 01:00 2.21
2018-10-01 02:00 6.40
... ...
2018-10-02 00:00 3.21
2018-10-02 01:00 3.42
2018-10-03 02:00 2.99
... ...
That means that I have one value for y
per each hour.
I would like to filter the rows so that the values which are not inside the 6-sigma interval (3*std, -3*std) are dropped.
I'm able to do this for the entire DataFrame this way:
df = df[np.abs(df.y-df.y.mean()) <= (3*df.y.std())]
But I would like to do this in a per-day basis.
Please note that ds
is a datetime64[ns]
and y
a float64
.
Also, since my ultimate goal is to exclude outliers from data, can you suggest other viable options to accomplish this?
Upvotes: 1
Views: 92
Reputation: 153510
Try this:
g = df.groupby(df.index.floor('D'))['y']
df[(np.abs(df.y - g.transform('mean')) <= (3*g.transform('std')))]
Upvotes: 0