Reputation: 2360
I have pandas
Dataframe, i want to eliminate extreme values for a column.
eg: I have pandas
data frame called df
, and have column called percentage
in it. I want to filter out the data frame based on the following condition, eliminate first 10 percentile and last 10 percentile based on values in percentage column.
I want to filter it to 10th to 90th percentile.
I thought of the following,
df[(df.percentage > np.percentile(df.percentage, 10 )) & (df.percentage < np.percentile(df.percentage, 90 ))]
Is it the right approach ? or can someone recommend faster approach if any ?
Upvotes: 2
Views: 4682
Reputation: 1106
The whole dataframe should be filtered on the percentiles of df['percentage']
?
import numpy as np
import pandas as pd
df = pd.DataFrame({'percentage': abs(np.random.normal(loc=50, scale=30, size=100)),
'var1': np.random.rand(100),
'var2': np.random.rand(100),
'var3': np.random.rand(100)})
# Find out percentiles
lower = np.percentile(df['percentage'], 10)
upper = np.percentile(df['percentage'], 90)
# Select data between
trimmed = df[df.percentage.between(lower, upper)]
Upvotes: 4
Reputation: 2495
Try .quantile
p_10 = df.percentage.quantile(0.1)
p_90 = df.percentage.quantile(0.9)
df[df.percentage.gt(p_10) & df.percentage.lt(p_90)]
Upvotes: 6