Reputation: 10051
I'm working with the following data frame, how can I groupby city
and drop only upper outliers in each column of num1
and num2
, the example outliers in num1
such as 9473, 9450, 9432
for bj
and 7200, 5600
for sh
? Thanks.
id address num1 num2
0 1001 bj 9473 0
1 1002 bj 9450 189
2 1003 bj 9432 1574
3 1004 bj 4010 4802
4 1005 bj 3910 30747
5 1006 bj 3808 45373
6 1007 bj 3384 48846
7 1008 bj 3315 11377
8 1009 bj 2679 33207
9 1010 bj 2485 21483
10 1011 bj 2436 42125
11 1012 bj 2382 11034
12 1013 bj 1698 44503
13 1014 bj 1657 18383
14 1015 bj 1603 28072
15 1016 bj 1557 8427
16 1017 bj 1541 29578
17 1018 bj 1494 23471
18 1019 sh 7200 14691
19 1020 sh 5600 20321
20 1021 sh 1383 2152
21 1022 sh 1321 24152
22 1023 sh 1295 6770
23 1024 sh 1292 6173
24 1025 sh 1236 18965
25 1026 sh 1223 31745
26 1027 sh 1209 0
27 1028 sh 1196 4206
28 1029 sh 1182 14530
29 1030 sh 1165 15300
30 1031 sh 1162 22701
31 1032 sh 1143 36859
32 1033 sh 1130 23382
33 1034 sh 1129 29679
34 1035 sh 1117 3388
35 1036 sh 1066 39502
36 1037 sh 1066 24099
37 1038 sh 1036 3617
So far I have tried two methods:
Method 1:
First I define outlier_iqr
function to detect and filter extreme high values:
def outliers_iqr(ys):
Q1, Q3 = np.percentile(ys, [1, 99])
iqr = Q3 - Q1
# lower_bound = Q1 - (iqr * 1.5) # not necessary since only detect upper outliers
upper_bound = Q3 + (iqr * 1.5)
# return np.where((ys > upper_bound) | (ys < lower_bound))
return np.where((ys > upper_bound))
Then I need apply the function to columns num1
and num2
and write to excel:
df[['num1','num2']] = df[['num1','num2']].apply(outlier_iqr)
Method 2, which doesn't work yet:
Q1 = df1['num1'].quantile(0.01)
Q3 = df1['num1'].quantile(0.99)
IQR = Q3 - Q1
Q1 = df1['num2'].quantile(0.01)
Q3 = df1['num2'].quantile(0.99)
IQR = Q3 - Q1
df[df.groupby("address")['num1', 'num2'].transform(lambda x : x < (Q3 + 1.5*IQR)).eq(1)]
Thank you for your help.
Reference related:
Remove outliers in Pandas dataframe with groupby
Upvotes: 1
Views: 605
Reputation: 863361
I believe you need:
f = lambda x : (x<x.quantile(0.99))
mask = df.groupby("address")['num1', 'num2'].transform(f).all(axis=1)
#test removed rows
df2 = df[~mask]
print (df)
id address num1 num2
0 1001 bj 9473 0
6 1007 bj 3384 48846
18 1019 sh 7200 14691
35 1036 sh 1066 39502
#remove rows
df1 = df[mask]
Upvotes: 3