Reputation: 695
First time running anomaly detection with isolation forest (sklearn). I have a dataframe of continuous variables mixed with categorical variables which I one hot encoded. I then created a pivot table (sum aggregate of all my fields over each day). I ran the model and added the anomaly scores and labels to my table. Question is... is there a way to see which feature(s) are contributing to the anomaly label for any given anomaly?
Also any advice on optimization would be appreciated
cols = list of fields for my model (approximately 150)
ohe = OneHotEncoder()
idf = ohe.fit_transform(df[cols]).toarray()
idf = pd.DataFrame(idf, index=df.index)
idf.columns = ohe.get_feature_names()
x = pd.concat([df.drop(columns=cols),idf],axis=1)
x = pd.pivot_table(x, index='date', values=x.loc[:, x.columns != 'date'], aggfunc=np.sum)
clf=IsolationForest(n_estimators=100, max_samples='auto', contamination=float(.12), max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0)
clf.fit(x)
x['score'] = clf.decision_function(x)
x['anomaly']= clf.predict(x.loc[:, x.columns != 'score'])
Upvotes: 0
Views: 492
Reputation: 2042
To understand a little bit the contribution of each feature I recommend you use the SHAP library.
Here you will find how to implement it in python, but basically, this library is telling you how and which values of each features are contributing to you anomaly detection algorithm.
Upvotes: 1