ah bon
ah bon

Reputation: 10051

Groupby and remove upper outliers in Python

I'm working with the following data frame, how can I groupby city and drop only upper outliers in each column of num1 and num2, the example outliers in num1 such as 9473, 9450, 9432 for bj and 7200, 5600 for sh? Thanks.

      id address  num1   num2
0   1001      bj  9473      0
1   1002      bj  9450    189
2   1003      bj  9432   1574
3   1004      bj  4010   4802
4   1005      bj  3910  30747
5   1006      bj  3808  45373
6   1007      bj  3384  48846
7   1008      bj  3315  11377
8   1009      bj  2679  33207
9   1010      bj  2485  21483
10  1011      bj  2436  42125
11  1012      bj  2382  11034
12  1013      bj  1698  44503
13  1014      bj  1657  18383
14  1015      bj  1603  28072
15  1016      bj  1557   8427
16  1017      bj  1541  29578
17  1018      bj  1494  23471
18  1019      sh  7200  14691
19  1020      sh  5600  20321
20  1021      sh  1383   2152
21  1022      sh  1321  24152
22  1023      sh  1295   6770
23  1024      sh  1292   6173
24  1025      sh  1236  18965
25  1026      sh  1223  31745
26  1027      sh  1209      0
27  1028      sh  1196   4206
28  1029      sh  1182  14530
29  1030      sh  1165  15300
30  1031      sh  1162  22701
31  1032      sh  1143  36859
32  1033      sh  1130  23382
33  1034      sh  1129  29679
34  1035      sh  1117   3388
35  1036      sh  1066  39502
36  1037      sh  1066  24099
37  1038      sh  1036   3617

So far I have tried two methods:

Method 1:

First I define outlier_iqr function to detect and filter extreme high values:

def outliers_iqr(ys):
    Q1, Q3 = np.percentile(ys, [1, 99])
    iqr = Q3 - Q1
    # lower_bound = Q1 - (iqr * 1.5)  # not necessary since only detect upper outliers
    upper_bound = Q3 + (iqr * 1.5)
    # return np.where((ys > upper_bound) | (ys < lower_bound))
    return np.where((ys > upper_bound))

Then I need apply the function to columns num1 and num2 and write to excel:

df[['num1','num2']] = df[['num1','num2']].apply(outlier_iqr)

Method 2, which doesn't work yet:

Q1 = df1['num1'].quantile(0.01)
Q3 = df1['num1'].quantile(0.99)
IQR = Q3 - Q1

Q1 = df1['num2'].quantile(0.01)
Q3 = df1['num2'].quantile(0.99)
IQR = Q3 - Q1

df[df.groupby("address")['num1', 'num2'].transform(lambda x : x < (Q3 + 1.5*IQR)).eq(1)]

Thank you for your help.

Reference related:

Remove outliers in Pandas dataframe with groupby

Upvotes: 1

Views: 605

Answers (1)

jezrael
jezrael

Reputation: 863361

I believe you need:

f = lambda x : (x<x.quantile(0.99))
mask = df.groupby("address")['num1', 'num2'].transform(f).all(axis=1)

#test removed rows
df2 = df[~mask]
print (df)
      id address  num1   num2
0   1001      bj  9473      0
6   1007      bj  3384  48846
18  1019      sh  7200  14691
35  1036      sh  1066  39502

#remove rows
df1 = df[mask]

Upvotes: 3

Related Questions