Reputation: 47
In given data frame I have these two columns:
neighbourhood_group
price
Price column contains all the prices for all neighbourhood_group:
neighbourhood_group price
0 Brooklyn 149
1 Manhattan 225
2 Manhattan 150
3 Brooklyn 89
4 Manhattan 80
5 Manhattan 200
6 Brooklyn 60
7 Manhattan 79
8 Manhattan 79
9 Manhattan 150
I am trying to detect outliers withing each neighbourhood_group.
The only idea I have come up to so far is to group by prices by neighbourhood_group, detect outliers within each group and create a mask for rows that needs to be dropped.
data.groupby('neighbourhood_group')['price']
I suspect there might be an easier solution for that.
Upvotes: 1
Views: 117
Reputation: 10903
I'll do it a bit manually:
let's assume your df is this (note I added 2 lines at the bottom)
neighbourhood_group price
0 Brooklyn 149
1 Manhattan 225
2 Manhattan 150
3 Brooklyn 89
4 Manhattan 80
5 Manhattan 200
6 Brooklyn 60
7 Manhattan 79
8 Manhattan 79
9 Manhattan 150
10 Manhattan 28
11 Manhattan 280
let's add 2 column to facilitate here:
df['mean']=df.groupby('neighbourhood_group').transform('mean')
df['std'] = df.groupby('neighbourhood_group')['price'].transform('std')
let's ask for true/false if is_outlier
df['is_outlier'] = df.apply(lambda x: x['price']+x['std']<x['mean'] or x['price']-x['std']>x['mean'], axis=1)
result
neighbourhood_group price mean std is_outlier
0 Brooklyn 149 99.333333 45.390895 True
1 Manhattan 225 141.222222 82.308532 True
2 Manhattan 150 141.222222 82.308532 False
3 Brooklyn 89 99.333333 45.390895 False
4 Manhattan 80 141.222222 82.308532 False
5 Manhattan 200 141.222222 82.308532 False
6 Brooklyn 60 99.333333 45.390895 False
7 Manhattan 79 141.222222 82.308532 False
8 Manhattan 79 141.222222 82.308532 False
9 Manhattan 150 141.222222 82.308532 False
0 Manhattan 28 141.222222 82.308532 True
1 Manhattan 280 141.222222 82.308532 True
Also: note by @Willem Van Onsem the definition of an 'outlier' is usually 3 sigma above/below the mean. Consider this in your work and you can define your deviation from the mean (I used std=1)
Upvotes: 1
Reputation: 433
I think using groupby makes perfectly sense. I would then get the single groups, using get_group method for example. Finally you can do any analysis you need, see this example in case you missed it
Detect and exclude outliers in Pandas data frame
Cheers and good work, I'll follow the question as I'm interested too
Upvotes: 1
Reputation: 42946
You can use Groupby.apply
and then get all the values which are outside the range of 3 * std
while substracting each value with the mean
:
m = df.groupby('neighbourhood_group')['price'].apply(lambda x: x.sub(x.mean()).abs() <= (x.std()*3))
df[m]
Output
neighbourhood_group price
0 Brooklyn 149
1 Manhattan 225
2 Manhattan 150
3 Brooklyn 89
4 Manhattan 80
5 Manhattan 200
6 Brooklyn 60
7 Manhattan 79
8 Manhattan 79
9 Manhattan 150
note: in this case we get all the rows back, since there are no outliers.
Upvotes: 2