Reputation: 2750
I am having the below data frame which is a time-series data and I process this information to input to my prediction models.
df = pd.DataFrame({"timestamp": [pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None)],
"value":[5.4,5.1,100.8,20.12,21.5,80.08,150.09,160.12,20.06]
})
From this, I take the mean of the value for each timestamp and will send the value as the input to the predictor. But currently, I am using just thresholds to filter out the outliers,but those seem to filter out real vales and also not filter some outliers .
For example, I kept
df[(df['value']>3 )& (df['value']<120 )]
and then this does not filter out
2019-01-01 01:00:00 100.8
which is an outlier for that timestamp and does filter out
2019-01-01 03:00:00 150.09
2019-01-01 03:00:00 160.12
which are not outliers for that timestamp.
So how do I filter out outliers for each timestamp based on which does not fit that group?
Any help is appreciated.
Upvotes: 1
Views: 344
Reputation: 6270
Ok, let's assume you are searching for the confidence interval to detect outlier.
Then you have to get the mean and the confidence intervals for each timestamp group. Therefore you can run:
stats = df.groupby(['timestamp'])['value'].agg(['mean', 'count', 'std'])
ci95_hi = []
ci95_lo = []
import math
for i in stats.index:
m, c, s = stats.loc[i]
ci95_hi.append(m + 1.96*s/math.sqrt(c))
ci95_lo.append(m - 1.96*s/math.sqrt(c))
stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
df = pd.merge(df, stats, how='left', on='timestamp')
which leads to the following output:
then you can adjust a filter column:
import numpy as np
df['Outlier'] = np.where(df['value'] >= df['ci95_hi'], 1, np.where(df['value']<= df['ci95_lo'], 1, 0))
then everythign with a 1 in the column outlier is an outlier. You can adjust the values with 1.96 to play a little with it.
Upvotes: 1