Reputation: 585
I have a dataframe with price column (p) and I have some undesired values like (0, 1.50, 92.80, 0.80). Before I calculate the mean of the price by product code, I would like to remove these outliers
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
7 100 2017 1 28 2.0 92.80
8 100 2017 2 1 0.0 0.00
9 100 2017 2 7 2.0 1.50
10 100 2017 2 8 5.0 0.80
11 100 2017 2 9 1.0 45.05
12 100 2017 2 11 1.0 1.50
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
16 100 2017 3 30 2.0 1.50
How would be a good way to filter the outliers for each product (group by code) ?
I tried this:
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Code','P']].groupby('Code').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
df[outliers.any(axis=1)]
And then :
print(df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean())
But the outlier filter doesn`t work properly.
Upvotes: 1
Views: 613
Reputation: 164783
You have the right idea. Just take the Boolean opposite of your outliers['P']
series via ~
and filter your dataframe via loc
:
res = df.loc[~outliers['P']]\
.groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean()
print(res)
Code Year Month P
0 100 2017 1 44.821429
1 100 2017 2 45.050000
2 100 2017 3 46.666667
Upvotes: 1
Reputation: 51405
IIUC You can use a groupby on Code
, do your z
score calculation on P
, and filter if the z
score is greater than your threshold:
stds = 1.0
filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
11 100 2017 2 9 1.0 45.05
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()
P
Code Year Month
100 2017 1 44.821429
2 45.050000
3 46.666667
Upvotes: 2