Fredz0r
Fredz0r

Reputation: 621

Implement custom .fit() method for model in sklearn pipeline

I am using a number of pipelines to compare in cross validation. As a benchmark model I want to include a simple model which uses always the same fixed coefficient, and hence, doesn't depend on the training data. In order to get the model I want I have decided to inherit all of the behaviour of sklearns linear model and implement my own .fit() method, which in fact doesn't look at the train data, but always uses a stored model.

When using my custom implementation as a model it works fine, however, as part of a pipeline I get a NotFittedError.

Creating my simple benchmark model and storing it:

import numpy as np
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

X = np.array([[1],[2],[3]])
y = [10,20,30]

model = LinearRegression(fit_intercept=False).fit(X,y)
pickle.dump(model, open('benchmark_model.txt', 'wb'))
print (model.coef_)

[10.]

Defining my own benchmark_model() which implements custom fit method. The fit method opens the stored model

class benchmark_model(LinearRegression):
      def fit(self, X, y = None):
            self = pickle.load(open('benchmark_model.txt', 'rb')) 
            return self

Testing the custom fit implementation as model on different data seems to go well.

X=np.array([[1],[2],[3]])
y=[5,10,15]

model = benchmark_model()
model = model.fit(X,y)

print (model.coef_)
print (model.predict(X))

[10.] [10. 20. 30.]

Now, I am first using a normal LinearRegression as part of a pipeline, which seems to go as expected:

pipe = Pipeline([('model',LinearRegression())])
pipe.fit(X,y).predict(X)

array([ 5., 10., 15.])

However, when I use my custom benchmark model as part of the pipeline it doesn't work anymore.

pipe = Pipeline([('model',benchmark_model())])
pipe.fit(X,y).predict(X)

NotFittedError: This benchmark_model instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

Upvotes: 1

Views: 962

Answers (1)

rvf
rvf

Reputation: 1449

I assume the pipeline gets confused when benchmark_model.fit() returns an instance of class LinearRegression instead of benchmark_model. It seems to work, if instead we just copy the learned parameters from the fixed model:

class benchmark_model(LinearRegression):
    def fit(self, X, y = None):
        fixed_model = pickle.load(open('benchmark_model.txt', 'rb')) 
        self.coef_ = fixed_model.coef_
        self.intercept_ = fixed_model.intercept_
        return self

Now fit actually returns an instance of benchmark_model.

Upvotes: 1

Related Questions