Gara
Gara

Reputation: 47

Detecting outliers within one column for ranges of rows

In given data frame I have these two columns:

 neighbourhood_group
 price

Price column contains all the prices for all neighbourhood_group:

    neighbourhood_group price
 0  Brooklyn            149
 1  Manhattan           225
 2  Manhattan           150
 3  Brooklyn            89
 4  Manhattan           80
 5  Manhattan           200
 6  Brooklyn            60
 7  Manhattan           79
 8  Manhattan           79
 9  Manhattan           150

I am trying to detect outliers withing each neighbourhood_group.

The only idea I have come up to so far is to group by prices by neighbourhood_group, detect outliers within each group and create a mask for rows that needs to be dropped.

 data.groupby('neighbourhood_group')['price']

I suspect there might be an easier solution for that.

Upvotes: 1

Views: 117

Answers (3)

adhg
adhg

Reputation: 10903

I'll do it a bit manually:

let's assume your df is this (note I added 2 lines at the bottom)

    neighbourhood_group price
0   Brooklyn    149
1   Manhattan   225
2   Manhattan   150
3   Brooklyn    89
4   Manhattan   80
5   Manhattan   200
6   Brooklyn    60
7   Manhattan   79
8   Manhattan   79
9   Manhattan   150
10  Manhattan   28
11  Manhattan   280

let's add 2 column to facilitate here:

df['mean']=df.groupby('neighbourhood_group').transform('mean')
df['std'] = df.groupby('neighbourhood_group')['price'].transform('std')

let's ask for true/false if is_outlier

df['is_outlier'] = df.apply(lambda x: x['price']+x['std']<x['mean'] or x['price']-x['std']>x['mean'], axis=1)

result

    neighbourhood_group price   mean              std   is_outlier
0   Brooklyn            149     99.333333   45.390895   True
1   Manhattan           225     141.222222  82.308532   True
2   Manhattan           150     141.222222  82.308532   False
3   Brooklyn            89      99.333333   45.390895   False
4   Manhattan           80      141.222222  82.308532   False
5   Manhattan           200     141.222222  82.308532   False
6   Brooklyn            60      99.333333   45.390895   False
7   Manhattan           79      141.222222  82.308532   False
8   Manhattan           79      141.222222  82.308532   False
9   Manhattan           150     141.222222  82.308532   False
0   Manhattan           28      141.222222  82.308532   True
1   Manhattan           280     141.222222  82.308532   True

Also: note by @Willem Van Onsem the definition of an 'outlier' is usually 3 sigma above/below the mean. Consider this in your work and you can define your deviation from the mean (I used std=1)

Upvotes: 1

Peruz
Peruz

Reputation: 433

I think using groupby makes perfectly sense. I would then get the single groups, using get_group method for example. Finally you can do any analysis you need, see this example in case you missed it

Detect and exclude outliers in Pandas data frame

Cheers and good work, I'll follow the question as I'm interested too

Upvotes: 1

Erfan
Erfan

Reputation: 42946

You can use Groupby.apply and then get all the values which are outside the range of 3 * std while substracting each value with the mean:

m = df.groupby('neighbourhood_group')['price'].apply(lambda x: x.sub(x.mean()).abs() <= (x.std()*3))

df[m]

Output

  neighbourhood_group  price
0            Brooklyn    149
1           Manhattan    225
2           Manhattan    150
3            Brooklyn     89
4           Manhattan     80
5           Manhattan    200
6            Brooklyn     60
7           Manhattan     79
8           Manhattan     79
9           Manhattan    150

note: in this case we get all the rows back, since there are no outliers.

Upvotes: 2

Related Questions