Reputation: 567
I have an example of numerical column in a df
having 10 other columns (both numerical and categorical):
Units
-12
4
4
5
1
5
12
6
34
6
7
12
745
I would like to apply the formulas:
Low outlier: q1-(1.5*iqr)
High outlier: q3+(1.5*iqr)
I know that there is iqr
in scipy
: from scipy.stats import iqr
and using numpy
I can calculate q1
(first quartile) and q3
(third quartile) as follows:
from scipy.stats import iqr
import numpy as np
q1=np.percentile(df.Units, 25)
q3=np.percentile(df.Units, 75)
mask = df['Units'].between(q1, q3, inclusive=True)
iqr = df.loc[mask, 'Units']
However there are two missing steps in calculating the formulas above:
NaN
values: I would prefer to not remove them from my column, but only to exclude them from calculations;Low outlier: q1-(1.5*iqr)
High outlier: q3+(1.5*iqr)
I would say that maybe it could be possible by using between
or just filtering values lower/higher than values calculated from the formulas above.
Upvotes: 1
Views: 230
Reputation: 5294
Lazy attempt with df.query
, excludes NaN
s from calculations and keep them in the final query:
from scipy.stats import iqr
df = pd.DataFrame([-12,4,4,5,1,5,12,6,34,6,7,np.nan, 12,745], columns=["Units"])
q1 = np.nanpercentile(df.Units, 25)
q3 = np.nanpercentile(df.Units, 75)
IQR = iqr(df.Units, nan_policy="omit")
df.query("Units.isnull() or ((@q1 - 1.5*@IQR) <= Units <= (@q3 + 1.5*@IQR))")
which gives the (hopefully) expected result:
Units
1 4.0
2 4.0
3 5.0
4 1.0
5 5.0
6 12.0
7 6.0
9 6.0
10 7.0
11 NaN
12 12.0
Upvotes: 1