Isolation Forest Evaluation

Question

First time running anomaly detection with isolation forest (sklearn). I have a dataframe of continuous variables mixed with categorical variables which I one hot encoded. I then created a pivot table (sum aggregate of all my fields over each day). I ran the model and added the anomaly scores and labels to my table. Question is... is there a way to see which feature(s) are contributing to the anomaly label for any given anomaly?

Also any advice on optimization would be appreciated

cols = list of fields for my model (approximately 150)
ohe = OneHotEncoder()
idf = ohe.fit_transform(df[cols]).toarray()
idf = pd.DataFrame(idf, index=df.index)
idf.columns = ohe.get_feature_names()

x = pd.concat([df.drop(columns=cols),idf],axis=1)
x = pd.pivot_table(x, index='date', values=x.loc[:, x.columns != 'date'], aggfunc=np.sum)

clf=IsolationForest(n_estimators=100, max_samples='auto', contamination=float(.12), max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0)
clf.fit(x)
x['score'] = clf.decision_function(x)
x['anomaly']= clf.predict(x.loc[:, x.columns != 'score'])

Isolation Forest Evaluation

Answers (1)

Related Questions