gnashergnawer
gnashergnawer

Reputation: 11

How to scale both X and y data in sklearn pipeline?

I am trying to scale both the X feature data and y output data in my sklearn pipeline. My code is as below, using grid search to calculate the optimum number of LVs using cross validation.

kfold = KFold(n_splits = 5, shuffle = False) # Kfold
pipeline = Pipeline(steps = [('preprocessor',StandardScaler()),('model',PLSRegression()]) # Pipeline

param_grid = {'model__n_components':np.arange(1,10)} # param grid for no of components
  search = GridSearchCV(pipeline,param_grid, scoring = 'neg_mean_squared_error',cv = kfold, refit = True) # grid search CV using 5 fold CV, refitting best model with full dataset
  search.fit(Xtrain,Ytrain) # search through grid  

Upvotes: 1

Views: 2324

Answers (2)

racerX
racerX

Reputation: 1092

In your pipeline, replace PLSRegression() with TransformedTargetRegressor(regressor=PLSRegression(), transformer=StandardScaler()). That should combine the target transformer into the sklearn pipeline.

Upvotes: 1

afsharov
afsharov

Reputation: 5164

Pipeline objects are meant to apply a series of transformations to the features before feeding them to the final estimator along with the target values. As of now, you cannot transform the target values within such a pipeline.

At the moment, the canonical way to perform a transformation on the target for regression tasks is to use the TransformedTargetRegressor. From the documentation:

Useful for applying a non-linear transformation to the target y in regression problems.

You can also pass the pipeline you defined in your question to a TransformedTargetRegressor object and specify a transformation or function which should be applied to the targets y. Here an example of how you would apply StandardScaler:

from sklearn.compose import TransformedTargetRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


pipeline = Pipeline(steps = [('preprocessor', StandardScaler()),('model',PLSRegression()])
estimator = TransformedTargetRegressor(estimator=pipeline, transformer=StandardScaler())

You can then pass this estimator object above to GridSearchCV for finding the best hyperparameters.

Upvotes: 2

Related Questions