user13676216
user13676216

Reputation:

Isolation forest, how to use multiple features to predict, getting all anomalies

I am trying to build an isolation forest using scikit learn and python to detect anomalies. I have attached an image of what the data may look like, and I am trying to predict 'pages' based on several 'size' features. enter image description here When I print(anomaly), every single row is detected as -1, an anomaly. Is this because I am only using 'size2' to classify them? Is there a way to use multiple columns to help in detecting the anomalies? Should I be making n_features equal to the number of columns I am using? Thank you so much for your help.

model = IsolationForest(n_estimators = 100, max_samples = 'auto', contamination = 'auto')
model.fit(df[['pages']])
df['size2'] = model.decision_function(df[['pages']])
df['anomaly']= model.predict(df[['pages']])
print(df.head(50))
anomaly = df.loc[df['anomaly']==-1]
anomaly_index = list(anomaly.index)
print(anomaly)

Upvotes: 0

Views: 3784

Answers (1)

Charles Gleason
Charles Gleason

Reputation: 416

I'm not sure an isolation forest is appropriate here. If you want to predict pages column values based on size data, you would be better off using either a regression model or a classifier (I can't tell whether pages is categorical based on the data shown). With that said, if you do want to do anomaly detection, you have to make sure that you're fitting your model on the same subset of features you use for prediction. To detect anomalies based on the size features looks something like this:

df['anomaly'] = model.fit_predict(df[['size2', 'size3', 'size4']])

Any subset of columns can be chosen to train the model on, but calls to both fit and predict must be made with the same feature set.

In the code given, the model is trained on the label column but used to predict outliers based on the pages column. Although the label column isn't shown, if the values in it are substantially different than those in the pages column it's not surprising that they would all be categorized as outliers. In addition, as written the size2 column is not being used as a feature for prediction but rather being overwritten by the decision function scores for the pages column.

Upvotes: 1

Related Questions