Mattia Paterna
Mattia Paterna

Reputation: 1356

sklearn: apply same scaling to train and predict in a pipeline

I am writing a function where the best model is chosen over a k-fold cross validation. Inside the function, I have a pipeline that

  1. scales the data
  2. seeks for the optimal parameters for a decision tree regressor

Then I want to use the model to predict some target values. To do so, I have to apply the same scaling that has been applied during the grid search.

Does the pipeline transform the data for which I want to predict the target using the same fit for the train data, even though I do not specify it? I've been looking in the documentation and from here seems that it does it, but I'm not sure at all since it's the first time I use pipelines.

def build_model(data, target, param_grid):
    # compute feature range
    features = df.keys()
    feature_range = dict()
    maxs = df.max(axis=0)
    mins = df.min(axis=0)
    for feature in features:
        if feature is not 'metric':
            feature_range[feature] = {'max': maxs[feature], 'min': mins[feature]}

    # initialise the k-fold cross validator
    no_split = 10
    kf = KFold(n_splits=no_split, shuffle=True, random_state=42)
    # create the pipeline
    pipe = make_pipeline(MinMaxScaler(), 
                         GridSearchCV(
                             estimator=DecisionTreeRegressor(), 
                             param_grid=param_grid, 
                             n_jobs=-1, 
                             cv=kf, 
                             refit=True))
    pipe.fit(data, target)

    return pipe, feature_range

max_depth = np.arange(1,10)
min_samples_split = np.arange(2,10)
min_samples_leaf = np.arange(2,10) 
param_grid = {'max_depth': max_depth, 
              'min_samples_split': min_samples_split, 
              'min_samples_leaf': min_samples_leaf}
pipe, feature_range = build_model(data=data, target=target, param_grid=param_grid)

# could that be correct?
pipe.fit(test_data)

EDIT: I found in the documentation for the [preprocessing] that each preprocessing tool has an API that

compute the [transformation] on a training set so as to be able reapply the same transformation on the testing set

If the case, it may save internally the transformation and therefore the answer may be positive.

Upvotes: 4

Views: 3780

Answers (1)

amanbirs
amanbirs

Reputation: 1108

The sklearn pipeline will call fit_transform or fit and then transform if no fit_transform method exists for all steps except the last step. So in your pipeline the scaling step would cause the data to be transformed before GridSearchCV.

Documentation here.

Upvotes: 3

Related Questions