Ricky
Ricky

Reputation: 2750

How to remove outliers specific to each timestamp?

I am having the below data frame which is a time-series data and I process this information to input to my prediction models.

df = pd.DataFrame({"timestamp": [pd.Timestamp('2019-01-01 01:00:00', tz=None),
                               pd.Timestamp('2019-01-01 01:00:00', tz=None),
                               pd.Timestamp('2019-01-01 01:00:00', tz=None),
                               pd.Timestamp('2019-01-01 02:00:00', tz=None),
                               pd.Timestamp('2019-01-01 02:00:00', tz=None),
                               pd.Timestamp('2019-01-01 02:00:00', tz=None),
                               pd.Timestamp('2019-01-01 03:00:00', tz=None),
                               pd.Timestamp('2019-01-01 03:00:00', tz=None),
                               pd.Timestamp('2019-01-01 03:00:00', tz=None)],
                   "value":[5.4,5.1,100.8,20.12,21.5,80.08,150.09,160.12,20.06]

                  })

From this, I take the mean of the value for each timestamp and will send the value as the input to the predictor. But currently, I am using just thresholds to filter out the outliers,but those seem to filter out real vales and also not filter some outliers .

For example, I kept

df[(df['value']>3 )& (df['value']<120 )]

and then this does not filter out

2019-01-01 01:00:00 100.8

which is an outlier for that timestamp and does filter out

2019-01-01 03:00:00 150.09
2019-01-01 03:00:00 160.12

which are not outliers for that timestamp.

So how do I filter out outliers for each timestamp based on which does not fit that group?

Any help is appreciated.

Upvotes: 1

Views: 344

Answers (1)

PV8
PV8

Reputation: 6270

Ok, let's assume you are searching for the confidence interval to detect outlier.

Then you have to get the mean and the confidence intervals for each timestamp group. Therefore you can run:

stats = df.groupby(['timestamp'])['value'].agg(['mean', 'count', 'std'])
ci95_hi = []
ci95_lo = []
import math
for i in stats.index:
    m, c, s = stats.loc[i]
    ci95_hi.append(m + 1.96*s/math.sqrt(c))
    ci95_lo.append(m - 1.96*s/math.sqrt(c))

stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
df = pd.merge(df, stats, how='left', on='timestamp')

which leads to the following output: enter image description here

then you can adjust a filter column:

import numpy as np
df['Outlier'] = np.where(df['value'] >= df['ci95_hi'], 1, np.where(df['value']<= df['ci95_lo'], 1, 0))

then everythign with a 1 in the column outlier is an outlier. You can adjust the values with 1.96 to play a little with it.

The outcome looks like: enter image description here

Upvotes: 1

Related Questions