sdom
sdom

Reputation: 319

How to remove outliers from groups based on percentile

I have a table df like this, but longer and with many other type values.

type weight
a 35.1
a 36.7
b 100.2
b 99.3
b 102.0
b 5.0
a 38.2
a 250.8

I want to remove from df all records with outliers using the 95th percentile but broken down into individual values ​​in the type column.

For a single value of type, I do it like this:

my_perc = 95
temp = df[df['type'] == 'a']
temp[temp.weight < np.percentile(temp.weight, my_perc)]

Now I would like to do this automatically for the whole table df, taking into account individual groups in the type column.

I also tried this:

df[df.groupby(['type'])['weight'] < np.percentile(df.weight, my_perc)]

But it doesn't work.

Do you have any idea for this?

Upvotes: 4

Views: 539

Answers (2)

ThePyGuy
ThePyGuy

Reputation: 18456

You can use DataFrame.groupby and group the dataframe on type then calculate the percentile using np.percentile for weight column for each of the group to get the required result:

df.groupby('type')['weight'].apply(lambda x:x[x.le(np.percentile(x, 95))]).droplevel(axis=0, level=1).reset_index()
  type  weight
0    a    35.1
1    a    36.7
2    a    38.2
3    b   100.2
4    b    99.3
5    b     5.0

Upvotes: 0

sdom
sdom

Reputation: 319

Ok, probably problem solved:

my_perc = 0.95
df[df.groupby('type')['weight'].transform(lambda x : x < x.quantile(my_perc))]

Upvotes: 3

Related Questions