arooki
arooki

Reputation: 1

Can GridSearchCV be used for unsupervised learning?

I am trying to build an outlier detector to find outliers in test data. That data varies a bit (more test channels, longer testing).

First im applying the train test split because i wanted to use grid search with train data to get the best results. This is timeseries data from multiple sensors and i removed the time column beforehand.

X shape : (25433, 17)
y shape : (25433, 1)

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=(0))

Standardize afterwards and then i changed them into an int Array because GridSearch doesnt seem to like continuous data. This surely can be done better, but i want this to work before i optimize the coding.

'X'
mean = StandardScaler().fit(X_train)
X_train = mean.transform(X_train)
X_test = mean.transform(X_test)

X_train = np.round(X_train,2)*100
X_train = X_train.astype(int)
X_test = np.round(X_test,2)*100
X_test = X_test.astype(int)

'y'
yeah = StandardScaler().fit(y_train)
y_train = yeah.transform(y_train)
y_test = yeah.transform(y_test)
y_train = np.round(y_train,2)*100
y_train = y_train.astype(int)
y_test = np.round(y_test,2)*100
y_test = y_test.astype(int)

I chose the IsoForrest because its fast, has pretty good results and can handle huge data sets (i currently only use a chunk of the data for testing). SVM might also be an option i want to check out. Then i set up the GridSearchCV

clf = IForest(random_state=47, behaviour='new',
              n_jobs=-1)

param_grid = {'n_estimators': [20,40,70,100], 
              'max_samples': [10,20,40,60], 
              'contamination': [0.1, 0.01, 0.001], 
              'max_features': [5,15,30], 
              'bootstrap': [True, False]}

fbeta = make_scorer(fbeta_score,
                    average = 'micro',
                    needs_proba=True,
                    beta=1)

grid_estimator = model_selection.GridSearchCV(clf, 
                                              param_grid,
                                              scoring=fbeta,
                                              cv=5,
                                              n_jobs=-1,
                                              return_train_score=True,
                                              error_score='raise',
                                              verbose=3)

grid_estimator.fit(X_train, y_train)

The Problem:

GridSearchCV needs an y argument, so i think this only works with supervised learning? If i run this i get the following error that i dont understand:

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets

Upvotes: 0

Views: 1286

Answers (2)

Gaurav Chawla
Gaurav Chawla

Reputation: 398

Agree with @Ben Reiniger's answer and it has good links for other SO posts on this topic.
You can try creating a custom scorer by assuming you can make use of y_train . This is not strictly unsupervised .

Here is one example where R2 score is used as a scoring metric.

from sklearn.metrics import r2_score

def scorer_f(estimator, X_train,Y_train):
  y_pred=estimator.predict(Xtrain)
  return r2_score(Y_train, y_pred)

Then you can use it as normal.

clf = IForest(random_state=47, behaviour='new',
              n_jobs=-1)

param_grid = {'n_estimators': [20,40,70,100], 
              'max_samples': [10,20,40,60], 
              'contamination': [0.1, 0.01, 0.001], 
              'max_features': [5,15,30], 
              'bootstrap': [True, False]}

grid_estimator = model_selection.GridSearchCV(clf, 
                                              param_grid,
                                              scoring=scorer_f,
                                              cv=5,
                                              n_jobs=-1,
                                              return_train_score=True,
                                              error_score='raise',
                                              verbose=3)

grid_estimator.fit(X_train, y_train)

Upvotes: 0

Ben Reiniger
Ben Reiniger

Reputation: 12592

You can use GridSearchCV for unsupervised learning, but it's often tricky to define a scoring metric that makes sense for the problem.

Here's an example in the docs that uses grid search for KernelDensity, an unsupervised estimator. It works without issue because this estimator has a score method (docs).

In your case, since IsolationForest doesn't have a score method, you'll need to define a custom scorer to pass as the search's scoring method. There's an answer at this question, and also this question, but I don't think the metrics given there necessarily makes sense. Unfortunately, I don't have a useful outlier detection metric in mind; that's a question better suited for the data science or statistics stackexchange sites.

Upvotes: 2

Related Questions