AliY
AliY

Reputation: 557

Outlier removal Isolation Forest

I've been trying to remove outliers from my database using isolation forest, but I can't figure out how. I've seen the examples for credit card fraud and Salary but I can't figure out how to apply them on each column as my database consists of 3862900 rows and 19 columns. I've uploaded an image of the head of my database. I can't figure out how to apply isolation forest on each column then permanently remove these outliers.enter image description here

Thank you.

enter image description here

Upvotes: 3

Views: 3399

Answers (3)

ihatecoding
ihatecoding

Reputation: 27

I know I'am late in answering this question, but as @Kenan said Isolation forest is used to identify outliers but totally remove them.

One thing you could do is use O_Sieve it performs automatic outlier removal and gives you the dataset.

pip install vcosmos
from spatial_domain.anaomaly import O_Sieve
sieve=O_Sieve(your_df target_column, tsf=2,tsf=2)
clean_df=sieve.filtered_data()
print(clean_df)

You can adjust the tsf and bsf params to chose how it affects the number of outliers. To read more, check the documentation of vcosmos

Upvotes: 0

Mario
Mario

Reputation: 1966

IsolationForest could intend to clean your data from outliers. As it says this answer, In usual machine learning settings, you would run it to clean your training dataset.

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=4, contamination=.1)
#identify outliers:
y_pred_train = clf.fit_predict(X_train)
#Remove outliers where 1 represent inliers and -1 represent outliers:
X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]

We could use a different method like IQR to parameterization for contamination in the unsupervised setting.

Upvotes: 3

Kenan
Kenan

Reputation: 14094

According to the docs is used for detecting outliers not removing them

df = pd.DataFrame({'temp': [1,2,3,345,6,7,5345, 8, 9, 10, 11]})
clf = IsolationForest().fit(df['temp'].values.reshape(-1, 1)) 
clf.predict([[4], [5], [3636]])

array([ 1, 1, -1])

As you can see from the output 4 and 5 are not outliers but 3636 is.

If you want to remove outliers from your dataframe you should use the IQR

quant = df['temp'].quantile([0.25, 0.75])
df['temp'][~df['temp'].clip(*quant).isin(quant)]
4     6
5     7
7     8
8     9
9    10

As you can see the outliers have been removed

For the whole df

def IQR(df, colname, bounds = [.25, .75]):
    s = df[colname]
    q = s.quantile(bounds)
    return df[~s.clip(*q).isin(q)]

Note: Isolation forest cannot remove outliers from your dataset, it is used to detect new outliers

Upvotes: 3

Related Questions