Reputation: 564

Remove outliers from training data

Assuming I have a pandas dataframe, I use the following to remove outliers:

y = df['Label']
df = df.drop(['Label'], axis=1)
new_df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

Since I don't want to include 'Label' column in the process, how to also remove the outlier labels?

Thank you

Upvotes: 1

Answers (3)

Reputation: 11

With the autooptimizer module, you can easily remove outliers from your dataset. It uses the Interquartile range method to remove outliers

pip install autooptimizer 

from autooptimizer.process import outlier_removal 

outlier_removal(data)

Upvotes: 0

Reputation: 518

You can use the remaining index to match the new df and Label column:

new_df.join(y)

Upvotes: 1

Reputation: 13527

Just perform the zscore calculation on the columns with a numeric dtype. No need to drop the "Label" column before hand.

new_df = df[(np.abs(stats.zscore(df.select_dtypes("numeric"))) < 3).all(axis=1)]

Upvotes: 2