How to make a GridSearchCV with a proper FunctionTransformer in a pipeline?

Question

I'm trying to make a Pipeline with GridSearchCV to filter data (with iforest) and perform a regression with StandarSclaler+MLPRegressor.

I made a FunctionTransformer to include my iForest filter in the pipeline. I also define a parameters grid for the iForest filter (using kw_args methods).

All seems OK but when un mahe the fit, nothing happens ... No error message. Nothing.

After, when I want to make a predict, I have the message : "This RandomizedSearchCV instance is not fitted yet"

from sklearn.preprocessing import FunctionTransformer

#Definition of the function auto_filter using the iForest algo
def auto_filter(DF, conta=0.1):
    #iForest made on the DF dataframe
    iforest = IsolationForest(behaviour='new', n_estimators=300, max_samples='auto', contamination=conta)
    iforest = iforest.fit(DF)

    # The DF (dataframe in input) is filtered taking into account only the inlier observations

data_filtered = DF[iforest.predict(DF) == 1]

    # Only few variables are kept for the next step (regression by MLPRegressor)
    # this function delivers X_filtered and y
    X_filtered = data_filtered[['SessionTotalTime','AverageHR','MaxHR','MinHR','EETotal','EECH','EEFat','TRIMP','BeatByBeatRMSSD','BeatByBeatSD','HFAverage','LFAverage','LFHFRatio','Weight']]
    y = data_filtered['MaxVO2']
    return (X_filtered, y)

#Pipeline definition ('auto_filter' --> 'scaler' --> 'MLPRegressor')    
pipeline_steps = [('auto_filter', FunctionTransformer(auto_filter)), ('scaler', StandardScaler()), ('MLPR', MLPRegressor(solver='lbfgs', activation='relu', early_stopping=True, n_iter_no_change=20, validation_fraction=0.2, max_iter=10000))]

#Gridsearch Definition with differents values of 'conta' for the first stage of the pipeline ('auto_filter)
parameters = {'auto_filter__kw_args': [{'conta': 0.1}, {'conta': 0.2}, {'conta': 0.3}], 'MLPR__hidden_layer_sizes':[(sp_randint.rvs(1, nb_features, 1),), (sp_randint.rvs(1, nb_features, 1), sp_randint.rvs(1, nb_features, 1))], 'MLPR__alpha':sp_rand.rvs(0, 1, 1)}   

pipeline = Pipeline(pipeline_steps)

estimator = RandomizedSearchCV(pipeline, parameters, cv=5, n_iter=10)
estimator.fit(X_train, y_train)

Sanjar Adilov · Accepted Answer

The func parameter of FunctionTransformer should be a callable that accepts the same arguments as transform method (array-like X of shape (n_samples, n_features) and kwargs for func) and returns a transformed X of the same shape. Your function auto_filter doesn't fit these requirements.

Additionally, anomaly/outlier detection techniques from scikit-learn cannot be used as intermediate steps in scikit-learn pipelines since a pipeline assembles one or more transformers and an optional final estimator. IsolationForest or, say, OneClassSVM is not a transformer: it implements fit and predict. Thus, a possible solution is to cut off possible outliers separately and build a pipeline composing of transformers and a regressor:

>>> import warnings
>>> from sklearn.exceptions import ConvergenceWarning
>>> warnings.filterwarnings(category=ConvergenceWarning, action='ignore')
>>> import numpy as np
>>> from scipy import stats
>>> from sklearn.datasets import make_regression
>>> from sklearn.ensemble import IsolationForest
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.neural_network import MLPRegressor
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X, y = make_regression(n_samples=50, n_features=2, n_informative=2)
>>> detect = IsolationForest(contamination=0.1, behaviour='new')
>>> inliers_mask = detect.fit_predict(X) == 1
>>> pipe = Pipeline([('scale', StandardScaler()),
...                  ('estimate', MLPRegressor(max_iter=500, tol=1e-5))])
>>> param_distributions = dict(estimate__alpha=stats.uniform(0, 0.1))
>>> search = RandomizedSearchCV(pipe, param_distributions,
...                             n_iter=2, cv=3, iid=True)
>>> search = search.fit(X[inliers_mask], y[inliers_mask])

The problem is that you won't be able to optimize the hyperparameters of IsolationForest. One way to handle it is to define hyperparameter space for the forest, sample hyperparameters with ParameterSampler or ParameterGrid, predict inliers and fit randomized search:

>>> from sklearn.model_selection import ParameterGrid
>>> forest_param_dict = dict(contamination=[0.1, 0.15, 0.2])
>>> forest_param_grid = ParameterGrid(forest_param_dict)
>>> for sample in forest_param_grid:
...     detect = detect.set_params(contamination=sample['contamination'])
...     inliers_mask = detect.fit_predict(X) == 1
...     search.fit(X[inliers_mask], y[inliers_mask])

How to make a GridSearchCV with a proper FunctionTransformer in a pipeline?

Answers (2)

Related Questions