prashanth manohar
prashanth manohar

Reputation: 680

Removing rows that have outliers in pandas data frame using Z - Score method

I am using this code to remove outliers.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame(np.random.randn(100, 3))
df[np.abs(stats.zscore(df[0])) < 1.5]

This works. We can see that the number of rows of data frame has reduced. However, I need to remove outliers in the percentage change values of a similar data frame.

df = df.pct_change()
df.plot.line(subplots=True)

df[np.abs(stats.zscore(df[0])) < 1.5]

This results in an empty data frame. What am I doing wrong? Should the value 1.5 be adjusted? I tried several values. Nothing works.

Upvotes: 1

Views: 389

Answers (1)

Corralien
Corralien

Reputation: 120509

It's because the first value of your dataframe is null due to pct_change. So use fillna to remove nan value.

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randn(100, 3))

pct = df[0].pct_change().fillna(0)
out = df[stats.zscore(pct).abs() < 1.5]

Output:

>>> out
           0         1         2
0   0.496714 -0.138264  0.647689
1   1.523030 -0.234153 -0.234137
2   1.579213  0.767435 -0.469474
3   0.542560 -0.463418 -0.465730
4   0.241962 -1.913280 -1.724918
..       ...       ...       ...
95 -1.952088 -0.151785  0.588317
96  0.280992 -0.622700 -0.208122
97 -0.493001 -0.589365  0.849602
98  0.357015 -0.692910  0.899600
99  0.307300  0.812862  0.629629

[92 rows x 3 columns]  # <- 8 rows have been removed

Upvotes: 1

Related Questions