zack
zack

Reputation: 53

Replace outlier with mean value

I have the following function that will remove the outlier but I want to replace them with mean value in the same column

        def remove_outlier(df_in, col_name):
        q1 = df_in[col_name].quantile(0.25)
        q3 = df_in[col_name].quantile(0.75)
        iqr = q3-q1 #Interquartile range
        fence_low  = q1-1.5*iqr
        fence_high = q3+1.5*iqr
        df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
        return df_out

Upvotes: 1

Views: 1451

Answers (2)

Ihmon
Ihmon

Reputation: 343

Nice function! However, when I pass arguments and run it, the following error occurs at df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean().

"FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas."

I just pass the mean value to the new variable ave and assign it to df_out.loc[outliers, col_name], then it works.

def replace_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df.copy()
    outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
    ave = df_out.loc[~outliers, col_name].mean()
    df_out.loc[outliers, col_name] = ave
    return df_out

My pandas version is 2.1.0.

Upvotes: 0

mayosten
mayosten

Reputation: 734

Let's try this. Identify the outliers based on your criteria, then directly assign the mean of the column to them for those records that are not outliers.

With some test data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': range(10), 'b': np.random.randn(10)})

# These will be our two outlier points
df.iloc[0] = -5
df.iloc[9] = 5

>>> df
   a         b
0 -5 -5.000000
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5  5.000000

def replace_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df.copy()
    outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
    df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
    return df_out

>>> replace_outlier(df, 'b')

   a         b
0 -5 -0.106019
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5 -0.106019

We can check that the fill value is equal to the mean for all of the other column values:

>>> df.iloc[1:9]['b'].mean()
-0.10601866399896176

Upvotes: 2

Related Questions