Reputation: 125
Here is my code:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import IsolationForest
data = pd.read_csv('marks1.csv', encoding='latin-1',
on_bad_lines='skip', index_col=0, header=0
)
random_state = np.random.RandomState(42)
model = IsolationForest(n_estimators=100, max_samples='auto', contamination=float(0.2)
, random_state=random_state)
model.fit(data[['Mark']])
random_state = np.random.RandomState(42)
data['scores'] = model.decision_function(data[['Mark']])
data['anomaly_score'] = model.predict(data[['Mark']])
data[data['anomaly_score'] == -1].head()
Error:
C:\Program Files\Python39\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names warnings.warn(
Upvotes: 10
Views: 17515
Reputation: 178
It depends on the version of sklearn you are using. In versions past 1.0, models have a feature_names attribute when trained with dataframes that integrates the column names. There was a bug in this version that threw an error when training with dataframes. https://github.com/scikit-learn/scikit-learn/issues/21577
I'm not up to date with the new best practices for this yet, so I cannot say definitively how it should be set up. But I just side stepped the issue in my code for now. To get around this, I convert my dataframes to a numpy array before training
df.to_numpy()
Upvotes: 16