Reputation: 189
I have a Pandas dataframe that I am trying to remove outliers from on a group by group basis. Each row in a group is considered an outlier the value of a column if it is outside the range of
[group_mean - (group_std_dev * 3), group_mean + (group_std_dev * 3)]
where group_mean is the average value of the column in the group, and group_std_dev is the standard deviation of the column for the group. I tried the following Pandas chain
df.groupby(by='group').apply(lambda x: x[(x['col'].mean() - (x['col'].std() * 3)) < x['col'] < (x['col'].mean() - (x['col'].std() * 3)])
but it does not appear the work as Pandas throws the following error for the comparison inside apply
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The error does not appear to make much sense to me because the comparison should convert to a Series of bools, which then is applied to the group x?
However filtering by just the upper or lower bound does work, like
df.groupby(by='group').apply(lambda x: x[(x['col'].mean() - (x['col'].std() * 3)) < x['col'])
but I am unsure of how to chain these together.
Does anyone have any ideas on how to simply & cleanly implement this? It doesn't appear very hard to me, but other posts on here have not yielded a satisfactory or working answer.
Upvotes: 1
Views: 2232
Reputation: 30930
Use GroupBy.transform
and Series.between
, this is faster:
groups = df.groupby('group')['col']
groups_mean = groups.transform('mean')
groups_std = groups.transform('std')
m = df['col'].between(groups_mean.sub(groups_std.mul(3)),
groups_mean.add(groups_std.mul(3)),
inclusive=False)
print(m)
new_df = df.loc[m]
When should I want to use apply
Your code with apply could be:
df.groupby(by='group')['col'].apply(lambda x: x.lt( x.mean().add(x.std().mul(3)) ) & x.gt( x.mean().sub(x.std().mul(3)) ))
Upvotes: 1