Chris
Chris

Reputation: 13680

Combine two fitted estimators into a pipeline

I have data at two stages:

import numpy as np

data_pre = np.array([[1., 2., 203.],
                     [0.5, np.nan, 208.]])

data_post = np.array([[2., 2., 203.],
                      [0.5, 2., 208.]])

I also have two pre-existing fitted estimators:

from sklearn.preprocessing import Imputer
from sklearn.ensemble import GradientBoostingRegressor

imp = Imputer(missing_values=np.nan, strategy='mean', axis=1).fit(data_pre)
gbm = GradientBoostingRegressor().fit(data_post[:,:2], data_post[:,2])

I need to pass a fitted pipeline and data_pre to another function.

def the_function_i_need(estimators):
    """
    """
    return fitted pipeline

fitted_pipeline = the_function_i_need([imp, gbm])
sweet_output = static_function(fitted_pipeline, data_pre) 

Is there a way to combine these two existing and fitted model objects into a fitted pipeline without refitting the models or am I out of luck?

Upvotes: 4

Views: 1914

Answers (4)

Thomas Bury
Thomas Bury

Reputation: 188

You can use the following code to create a pipeline in scikit-learn, fit it with training data, add a fitted estimator to the pipeline, and evaluate its performance on test data:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the initial pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RidgeClassifier())
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Create a new LogisticRegression estimator
estimator = LogisticRegression(solver='liblinear')

# Fit the estimator on the training data
estimator.fit(X_train, y_train)

# Add the fitted estimator to the pipeline
pipeline.set_params(model=estimator)

# Evaluate the pipeline with the added estimator on the test data
y_pred = pipeline.predict_proba(X_test)
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)
Accuracy: 0.9666666666666667

The pipeline is fitted on the training data, and then a new LogisticRegression estimator instance is created and fitted on the same training data. The fitted estimator is then added to the pipeline using pipeline.set_params().

Upvotes: 0

njp
njp

Reputation: 698

To elaborate on Abhinav Arora's answer:

Calling predict on a pipeline calls predict on the final step of the pipeline, and transform on all intermediate steps. Thus if we include a preprocessing model (estimator rather than a transformer) as an intermediate step in a pipeline, we need to alias the predict() method of the intermediate estimator as transform().

An example:

# Set up some data
import numpy as np

X = np.random.random((10,2)) # <- input into first model
y = np.random.random(10)     # <- This data represents output from first model,
                             #    and input into next model
z = np.random.random(10)     # <- output from second model, final predictor

Now, let's set up two models. The first operates on X to predict y, and the second operates on y to predict z.

from sklearn.linear_model import LinearRegression
mod1 = LinearRegression().fit(X,y.reshape(-1,1))
mod2 = LinearRegression().fit(y.reshape(-1,1),z.reshape(-1,1))

[The reshape methods here are just to convert vectors in column arrays.]

Now, each model can be used to predict with mod1.predict(X) and mod2.predict(y.reshape(-1,1)) respectively. However, if we combine these in a pipeline:

from sklearn.pipeline import Pipeline
p=Pipeline([('mod1', mod1),
            ('mod2', mod2)])

p.predict(X) fails with the error:

AttributeError: 'LinearRegression' object has no attribute 'transform'

which refers to the methods (attributes) of mod1. So we need to write a wrapper around mod1, as per Abhinav Arora's answer.

from sklearn.base import BaseEstimator

class Wrapper(BaseEstimator):
    def __init__(self, 
                 intermediate_model):                # Pass through the estimator here
        self.intermediate_model = intermediate_model
    def fit(self, X, y=None):                        # Assume model has already been fit
        return self                                  # so do nothing here 
    def transform(self, X):
        return self.intermediate_model.predict(X)    # alias predict as transform

wrapped_mod1 = Wrapper(mod1)

Then wrapped_mod1.transform(X) produces the same output as mod1.predict(X).

Now

p2=Pipeline([('mod1', wrapped_mod1),
             ('mod2', mod2)])
p2.predict(X)

works as expected.

NB: This method works when calling predict on the pipeline, but fails in cross_validate, and GridSearchCV etc. In those functions, the wrapped estimator is cloned and reset to an unfitted state. A workaround to get a pipeline working in cross_validate, etc. is to "freeze" the wrapped estimator by serializing (pickling) and reloading it.

Upvotes: 1

Cristobal
Cristobal

Reputation: 357

...A few years later. Use make_pipeline() to concatenate scikit-learn estimators as:

new_model = make_pipeline(fitted_preprocessor,
                          fitted_model)

Upvotes: 2

Abhinav Arora
Abhinav Arora

Reputation: 3389

I tried looking into this. I couldn't find any straightforward way to do this.

The only way I feel is to write a Custom Transformer which serves as a wrapper over the existing Imputer and GradientBoostingRegressor. You can initialize the wrapper with your already fitted Regressor and/or Imputer. You can then override the call to fit, by doing nothing in that. In all subsequent transform calls, you can call the transform of the underlying fitted model. This is a dirty way of doing this and should not be done until and unless this is very important to your application. A good tutorial on writing custom classes for Scikit-Learn Pipelines can be found here. Another working example of custom pipeline objects from scikit-learn's documentation can be found here.

Upvotes: 7

Related Questions