Reputation: 557
I've been trying to remove outliers from my database using isolation forest, but I can't figure out how. I've seen the examples for credit card fraud and Salary but I can't figure out how to apply them on each column as my database consists of 3862900 rows and 19 columns. I've uploaded an image of the head of my database. I can't figure out how to apply isolation forest on each column then permanently remove these outliers.
Thank you.
Upvotes: 3
Views: 3399
Reputation: 27
I know I'am late in answering this question, but as @Kenan said Isolation forest is used to identify outliers but totally remove them
.
One thing you could do is use O_Sieve
it performs automatic outlier removal and gives you the dataset.
pip install vcosmos
from spatial_domain.anaomaly import O_Sieve
sieve=O_Sieve(your_df target_column, tsf=2,tsf=2)
clean_df=sieve.filtered_data()
print(clean_df)
You can adjust the tsf and bsf params to chose how it affects the number of outliers. To read more, check the documentation of vcosmos
Upvotes: 0
Reputation: 1966
IsolationForest
could intend to clean your data from outliers. As it says this answer, In usual machine learning settings, you would run it to clean your training dataset.
from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=4, contamination=.1)
#identify outliers:
y_pred_train = clf.fit_predict(X_train)
#Remove outliers where 1 represent inliers and -1 represent outliers:
X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]
We could use a different method like IQR to parameterization for contamination
in the unsupervised setting.
Upvotes: 3
Reputation: 14094
According to the docs is used for detecting outliers not removing them
df = pd.DataFrame({'temp': [1,2,3,345,6,7,5345, 8, 9, 10, 11]})
clf = IsolationForest().fit(df['temp'].values.reshape(-1, 1))
clf.predict([[4], [5], [3636]])
array([ 1, 1, -1])
As you can see from the output 4
and 5
are not outliers but 3636 is.
If you want to remove outliers from your dataframe you should use the IQR
quant = df['temp'].quantile([0.25, 0.75])
df['temp'][~df['temp'].clip(*quant).isin(quant)]
4 6
5 7
7 8
8 9
9 10
As you can see the outliers have been removed
For the whole df
def IQR(df, colname, bounds = [.25, .75]):
s = df[colname]
q = s.quantile(bounds)
return df[~s.clip(*q).isin(q)]
Note: Isolation forest cannot remove outliers from your dataset, it is used to detect new outliers
Upvotes: 3