Cranjis
Cranjis

Reputation: 1960

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std. I want to replace, per column, each value that is more than 5 std with the max other value. For example,

df = A B
     1 2
     1 6
     2 8
     1 115
     191 1

Will become:

df = A B
     1 2
     1 6
     2 8
     1 8
     2 1

What is the best way to do it without a for loop over the columns?

Upvotes: 1

Views: 871

Answers (3)

wwnde
wwnde

Reputation: 26676

s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame

    A    B
0  1.0  2.0
1  1.0  6.0
2  2.0  8.0
3  1.0  8.0
4  2.0  1.0

Upvotes: 1

Grinjero
Grinjero

Reputation: 451

Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range

def calc_zscore(col):
   return (col - col.mean()) / col.std()

zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5

After that it's up to you to fill the values marked with the boolean mask.

df[outlier_mask] = something

Upvotes: 0

piterbarg
piterbarg

Reputation: 8219

Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do

q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df

this fixes column A:


    A   B
0   1   2
1   1   6
2   2   8
3   1   115
4   2   1

do the same for B

Upvotes: 0

Related Questions