Reputation: 1
I am trying to build an outlier detector to find outliers in test data. That data varies a bit (more test channels, longer testing).
First im applying the train test split because i wanted to use grid search with train data to get the best results. This is timeseries data from multiple sensors and i removed the time column beforehand.
X shape : (25433, 17)
y shape : (25433, 1)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=(0))
Standardize afterwards and then i changed them into an int Array because GridSearch doesnt seem to like continuous data. This surely can be done better, but i want this to work before i optimize the coding.
'X'
mean = StandardScaler().fit(X_train)
X_train = mean.transform(X_train)
X_test = mean.transform(X_test)
X_train = np.round(X_train,2)*100
X_train = X_train.astype(int)
X_test = np.round(X_test,2)*100
X_test = X_test.astype(int)
'y'
yeah = StandardScaler().fit(y_train)
y_train = yeah.transform(y_train)
y_test = yeah.transform(y_test)
y_train = np.round(y_train,2)*100
y_train = y_train.astype(int)
y_test = np.round(y_test,2)*100
y_test = y_test.astype(int)
I chose the IsoForrest because its fast, has pretty good results and can handle huge data sets (i currently only use a chunk of the data for testing). SVM might also be an option i want to check out. Then i set up the GridSearchCV
clf = IForest(random_state=47, behaviour='new',
n_jobs=-1)
param_grid = {'n_estimators': [20,40,70,100],
'max_samples': [10,20,40,60],
'contamination': [0.1, 0.01, 0.001],
'max_features': [5,15,30],
'bootstrap': [True, False]}
fbeta = make_scorer(fbeta_score,
average = 'micro',
needs_proba=True,
beta=1)
grid_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=fbeta,
cv=5,
n_jobs=-1,
return_train_score=True,
error_score='raise',
verbose=3)
grid_estimator.fit(X_train, y_train)
The Problem:
GridSearchCV needs an y argument, so i think this only works with supervised learning? If i run this i get the following error that i dont understand:
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
Upvotes: 0
Views: 1286
Reputation: 398
Agree with @Ben Reiniger's answer and it has good links for other SO posts on this topic.
You can try creating a custom scorer by assuming you can make use of y_train
. This is not strictly unsupervised
.
Here is one example where R2 score is used as a scoring metric.
from sklearn.metrics import r2_score
def scorer_f(estimator, X_train,Y_train):
y_pred=estimator.predict(Xtrain)
return r2_score(Y_train, y_pred)
Then you can use it as normal.
clf = IForest(random_state=47, behaviour='new',
n_jobs=-1)
param_grid = {'n_estimators': [20,40,70,100],
'max_samples': [10,20,40,60],
'contamination': [0.1, 0.01, 0.001],
'max_features': [5,15,30],
'bootstrap': [True, False]}
grid_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=scorer_f,
cv=5,
n_jobs=-1,
return_train_score=True,
error_score='raise',
verbose=3)
grid_estimator.fit(X_train, y_train)
Upvotes: 0
Reputation: 12592
You can use GridSearchCV
for unsupervised learning, but it's often tricky to define a scoring metric that makes sense for the problem.
Here's an example in the docs that uses grid search for KernelDensity
, an unsupervised estimator. It works without issue because this estimator has a score
method (docs).
In your case, since IsolationForest
doesn't have a score
method, you'll need to define a custom scorer to pass as the search's scoring
method. There's an answer at this question, and also this question, but I don't think the metrics given there necessarily makes sense. Unfortunately, I don't have a useful outlier detection metric in mind; that's a question better suited for the data science or statistics stackexchange sites.
Upvotes: 2