Reputation: 63
i'm facing an issue when replacing outliers with upper and lower boundary with Interquartile Rules, the kernel return an error saying "Must specify axis=0 or 1"
The code of defining the function of interquartile rules to replace outliers with upper and lower boundary as follow:
def iqr(df):
for col in df.columns:
if df[col].dtype != object:
Q1 = df[col].quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
S = 1.5*IQR
LB = Q1 - S
UB = Q3 + S
df[df > UB] = UB
ddf[df < LB] = LB
else:
break
return df
The dataframe is boston, which can be loaded from scikit learn
from sklearn.datasets import load_boston
df = pd.DataFrame(load_boston().data)
df.columns = boston.feature_names
df
Then, i use the function to replace the numerical attributes outliers with upper or lower boundary
iqr(df)
But then it turns out with the value error
ValueError: Must specify axis=0 or 1
Looking for help, thank you!
Upvotes: 1
Views: 1946
Reputation: 46968
Within the iteration through columns, you should always use df[col], and not df since you are working with only one column. so for example in your code:
Q3 = df.quantile(0.75)
should be
Q3 = df[col].quantile(0.75)
And
df[df > UB] = UB
should be
df.loc[df > UB,col] = UB
And so on ......
Without changing your function too much, this works:
def iqr(df):
for col in df.columns:
if df[col].dtype != object:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
S = 1.5*IQR
LB = Q1 - S
UB = Q3 + S
df.loc[df[col] > UB,col] = UB
df.loc[df[col] < LB,col] = LB
else:
break
return df
Consider writing the function for just one column, and use apply
:
def iqr(x):
IQR = np.diff(x.quantile([0.25,0.75]))[0]
S = 1.5*IQR
x[x < Q1 - S] = Q1 - S
x[x > Q3 + S] = Q1 + S
return x
df.select_dtypes('number') = df.select_dtypes('number').apply(iqr)
Upvotes: 1
Reputation: 667
To help debug this code, after you load in df
you could set col
and then run individual lines of code from inside your iqr
function.
import pandas as pd
# Make some toy data. Could also load boston dataset.
df = pd.DataFrame(dict(a=[-10, 100], b=[-100, 25]))
df
# Get the name of the first data column.
col = df.columns[0]
col
# Check if Q1 calculation works.
Q1 = df[col].quantile(0.25)
Q1
...
Upvotes: 0