V_sqrt
V_sqrt

Reputation: 567

Outliers formula for columns in pandas

I have an example of numerical column in a df having 10 other columns (both numerical and categorical):

Units
-12
4
4
5
1
5
12
6
34
6
7
12
745

I would like to apply the formulas:

Low outlier: q1-(1.5*iqr)
High outlier: q3+(1.5*iqr)

I know that there is iqr in scipy: from scipy.stats import iqr and using numpy I can calculate q1 (first quartile) and q3 (third quartile) as follows:

from scipy.stats import iqr
import numpy as np

q1=np.percentile(df.Units, 25)
q3=np.percentile(df.Units, 75)

mask = df['Units'].between(q1, q3, inclusive=True)
iqr = df.loc[mask, 'Units']

However there are two missing steps in calculating the formulas above:

Low outlier: q1-(1.5*iqr)

High outlier: q3+(1.5*iqr)

I would say that maybe it could be possible by using between or just filtering values lower/higher than values calculated from the formulas above.

Upvotes: 1

Views: 230

Answers (1)

filippo
filippo

Reputation: 5294

Lazy attempt with df.query, excludes NaNs from calculations and keep them in the final query:

from scipy.stats import iqr

df = pd.DataFrame([-12,4,4,5,1,5,12,6,34,6,7,np.nan, 12,745], columns=["Units"])

q1 = np.nanpercentile(df.Units, 25)
q3 = np.nanpercentile(df.Units, 75)
IQR = iqr(df.Units, nan_policy="omit")

df.query("Units.isnull() or ((@q1 - 1.5*@IQR) <= Units <= (@q3 + 1.5*@IQR))")

which gives the (hopefully) expected result:

    Units
1     4.0
2     4.0
3     5.0
4     1.0
5     5.0
6    12.0
7     6.0
9     6.0
10    7.0
11    NaN
12   12.0

Upvotes: 1

Related Questions