Reputation: 13680
I have data at two stages:
import numpy as np
data_pre = np.array([[1., 2., 203.],
[0.5, np.nan, 208.]])
data_post = np.array([[2., 2., 203.],
[0.5, 2., 208.]])
I also have two pre-existing fitted estimators:
from sklearn.preprocessing import Imputer
from sklearn.ensemble import GradientBoostingRegressor
imp = Imputer(missing_values=np.nan, strategy='mean', axis=1).fit(data_pre)
gbm = GradientBoostingRegressor().fit(data_post[:,:2], data_post[:,2])
I need to pass a fitted pipeline and data_pre
to another function.
def the_function_i_need(estimators):
"""
"""
return fitted pipeline
fitted_pipeline = the_function_i_need([imp, gbm])
sweet_output = static_function(fitted_pipeline, data_pre)
Is there a way to combine these two existing and fitted model objects into a fitted pipeline without refitting the models or am I out of luck?
Upvotes: 4
Views: 1914
Reputation: 188
You can use the following code to create a pipeline in scikit-learn, fit it with training data, add a fitted estimator to the pipeline, and evaluate its performance on test data:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create the initial pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RidgeClassifier())
])
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)
# Create a new LogisticRegression estimator
estimator = LogisticRegression(solver='liblinear')
# Fit the estimator on the training data
estimator.fit(X_train, y_train)
# Add the fitted estimator to the pipeline
pipeline.set_params(model=estimator)
# Evaluate the pipeline with the added estimator on the test data
y_pred = pipeline.predict_proba(X_test)
accuracy = pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)
Accuracy: 0.9666666666666667
The pipeline is fitted on the training data, and then a new LogisticRegression estimator instance is created and fitted on the same training data. The fitted estimator is then added to the pipeline using pipeline.set_params().
Upvotes: 0
Reputation: 698
To elaborate on Abhinav Arora's answer:
Calling predict
on a pipeline calls predict
on the final step of the pipeline, and transform
on all intermediate steps. Thus if we include a preprocessing model (estimator rather than a transformer) as an intermediate step in a pipeline, we need to alias the predict()
method of the intermediate estimator as transform()
.
An example:
# Set up some data
import numpy as np
X = np.random.random((10,2)) # <- input into first model
y = np.random.random(10) # <- This data represents output from first model,
# and input into next model
z = np.random.random(10) # <- output from second model, final predictor
Now, let's set up two models. The first operates on X
to predict y
, and the second operates on y
to predict z
.
from sklearn.linear_model import LinearRegression
mod1 = LinearRegression().fit(X,y.reshape(-1,1))
mod2 = LinearRegression().fit(y.reshape(-1,1),z.reshape(-1,1))
[The reshape
methods here are just to convert vectors in column arrays.]
Now, each model can be used to predict with mod1.predict(X)
and mod2.predict(y.reshape(-1,1))
respectively. However, if we combine these in a pipeline:
from sklearn.pipeline import Pipeline
p=Pipeline([('mod1', mod1),
('mod2', mod2)])
p.predict(X)
fails with the error:
AttributeError: 'LinearRegression' object has no attribute 'transform'
which refers to the methods (attributes) of mod1
. So we need to write a wrapper around mod1
, as per Abhinav Arora's answer.
from sklearn.base import BaseEstimator
class Wrapper(BaseEstimator):
def __init__(self,
intermediate_model): # Pass through the estimator here
self.intermediate_model = intermediate_model
def fit(self, X, y=None): # Assume model has already been fit
return self # so do nothing here
def transform(self, X):
return self.intermediate_model.predict(X) # alias predict as transform
wrapped_mod1 = Wrapper(mod1)
Then wrapped_mod1.transform(X)
produces the same output as mod1.predict(X)
.
Now
p2=Pipeline([('mod1', wrapped_mod1),
('mod2', mod2)])
p2.predict(X)
works as expected.
NB: This method works when calling predict
on the pipeline, but fails in cross_validate
, and GridSearchCV
etc. In those functions, the wrapped estimator is cloned and reset to an unfitted state. A workaround to get a pipeline working in cross_validate
, etc. is to "freeze" the wrapped estimator by serializing (pickling) and reloading it.
Upvotes: 1
Reputation: 357
...A few years later.
Use make_pipeline()
to concatenate scikit-learn estimators as:
new_model = make_pipeline(fitted_preprocessor,
fitted_model)
Upvotes: 2
Reputation: 3389
I tried looking into this. I couldn't find any straightforward way to do this.
The only way I feel is to write a Custom Transformer which serves as a wrapper over the existing Imputer and GradientBoostingRegressor. You can initialize the wrapper with your already fitted Regressor and/or Imputer. You can then override the call to fit
, by doing nothing in that. In all subsequent transform
calls, you can call the transform
of the underlying fitted model. This is a dirty way of doing this and should not be done until and unless this is very important to your application. A good tutorial on writing custom classes for Scikit-Learn Pipelines can be found here. Another working example of custom pipeline objects from scikit-learn's documentation can be found here.
Upvotes: 7