Reputation: 1

How to replace outliers with NaN while keeping row intact using pandas in python?

I am working with a very large file and need to eliminate different outliers for each column.

I have been able to find outliers and replace them with NaN, however it is turning the whole row into NaN. I'm sure that I'm missing somthing simple but I can't seem to find it.

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 100000)   
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)

df = pd.read_excel('example sheet.xlsx')   

df = df.replace(df.loc[df['column 2']<=0] ,np.nan)
print(df)

How can I convert only the one value into NaN and not the whole row?

Thanks

Upvotes: 0

Answers (3)

VnC

Reputation: 2016

You can do something like the following:

df.mask(df <= 0, np.nan, axis=1)

No need to iterate over columns.

However, I would suggest you to use proper statistics in order to define the outliers, instead of <= 0.

You can use quantiles like:

df.mask(((df < df.quantile(0.05)) or (df > df.quantile(0.95))), np.nan, axis=1)

Upvotes: 1

Sociopath

Reputation: 13401

Use np.where for replacing the value based on condition.

# if you have to perform only for single column
df['column 2'] = np.where(df['column 2']<=0, np.nan, df['column 2'])


# if you want to apply on all/multiple columns.
for col in df.columns:
    df[col] = np.where(df[col]<=0, np.nan, df[col])

Upvotes: 0

Yong Wang

Reputation: 1313

In order to change certain cell with NAN, you should change the series value. instead of dataframe replace, you should use series repalce.

The wrong way:

df = df.replace(df.loc[df['column 2']<=0] ,np.nan)

One of right way:

for col in df.columns:
    s = df[col]
    outlier_s = s<=0
    df[col] = s.where(~outlier_s,np.nan)

where function: Replace values where the condition is False.

http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where

Upvotes: 1

How to replace outliers with NaN while keeping row intact using pandas in python?

Answers (3)

where function: Replace values where the condition is False.

Related Questions