Reputation: 53
I have the following function that will remove the outlier but I want to replace them with mean value in the same column
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Upvotes: 1
Views: 1451
Reputation: 343
Nice function! However, when I pass arguments and run it, the following error occurs at df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
.
"FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas."
I just pass the mean value to the new variable ave
and assign it to df_out.loc[outliers, col_name]
, then it works.
def replace_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df.copy()
outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
ave = df_out.loc[~outliers, col_name].mean()
df_out.loc[outliers, col_name] = ave
return df_out
My pandas version is 2.1.0.
Upvotes: 0
Reputation: 734
Let's try this. Identify the outliers based on your criteria, then directly assign the mean of the column to them for those records that are not outliers.
With some test data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': range(10), 'b': np.random.randn(10)})
# These will be our two outlier points
df.iloc[0] = -5
df.iloc[9] = 5
>>> df
a b
0 -5 -5.000000
1 1 1.375111
2 2 -1.004325
3 3 -1.326068
4 4 1.689807
5 5 -0.181405
6 6 -1.016909
7 7 -0.039639
8 8 -0.344721
9 5 5.000000
def replace_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df.copy()
outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
return df_out
>>> replace_outlier(df, 'b')
a b
0 -5 -0.106019
1 1 1.375111
2 2 -1.004325
3 3 -1.326068
4 4 1.689807
5 5 -0.181405
6 6 -1.016909
7 7 -0.039639
8 8 -0.344721
9 5 -0.106019
We can check that the fill value is equal to the mean for all of the other column values:
>>> df.iloc[1:9]['b'].mean()
-0.10601866399896176
Upvotes: 2