Annamarie
Annamarie

Reputation: 303

removing known outliers from pandas dataframe

In a pandas dataframe subsets (here my outliers) should be removed:

example:

df = data[~(data.outlier1 == 1)]

But my dataframe has multiple outlier rows.

Is there something like:

 df = data[~((data.outlier1 == 1) or (data.outlier2 == 1) or (data.outlier3 == 1))]

The idea is to subtract all outliers (encoded in different rows) at the same time.

Upvotes: 1

Views: 1160

Answers (2)

mgoldwasser
mgoldwasser

Reputation: 15394

Another method is to truncate outliers by winsorizing. In the example below, each column will be capped and floored at the 5th and 95th percentile, without losing any rows:

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

Original data

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

Winsorized data

Upvotes: 0

EdChum
EdChum

Reputation: 393863

IIUC then you just need to use the bitwise or operator | to test for multiple conditions:

df = data[~((data.outlier1 == 1) | (data.outlier2 == 1) | (data.outlier3 == 1))]

The reason is because you are comparing arrays with a scalar so you should use the bitwise | operator rather than or

Upvotes: 2

Related Questions