Reputation:
I am trying to build an isolation forest using scikit learn and python to detect anomalies. I have attached an image of what the data may look like, and I am trying to predict 'pages' based on several 'size' features.
When I print(anomaly), every single row is detected as -1, an anomaly. Is this because I am only using 'size2' to classify them? Is there a way to use multiple columns to help in detecting the anomalies? Should I be making n_features equal to the number of columns I am using? Thank you so much for your help.
model = IsolationForest(n_estimators = 100, max_samples = 'auto', contamination = 'auto')
model.fit(df[['pages']])
df['size2'] = model.decision_function(df[['pages']])
df['anomaly']= model.predict(df[['pages']])
print(df.head(50))
anomaly = df.loc[df['anomaly']==-1]
anomaly_index = list(anomaly.index)
print(anomaly)
Upvotes: 0
Views: 3784
Reputation: 416
I'm not sure an isolation forest is appropriate here. If you want to predict pages
column values based on size data, you would be better off using either a regression model or a classifier (I can't tell whether pages
is categorical based on the data shown). With that said, if you do want to do anomaly detection, you have to make sure that you're fitting your model on the same subset of features you use for prediction. To detect anomalies based on the size features looks something like this:
df['anomaly'] = model.fit_predict(df[['size2', 'size3', 'size4']])
Any subset of columns can be chosen to train the model on, but calls to both fit
and predict
must be made with the same feature set.
In the code given, the model is trained on the label
column but used to predict outliers based on the pages
column. Although the label
column isn't shown, if the values in it are substantially different than those in the pages
column it's not surprising that they would all be categorized as outliers. In addition, as written the size2
column is not being used as a feature for prediction but rather being overwritten by the decision function scores for the pages
column.
Upvotes: 1