Alexander
Alexander

Reputation: 71

Is it possible to fit separate parts of an sklearn pipeline?

Consider having following sklearn Pipeline:

pipeline = make_pipeline(
    TfidfVectorizer(),
    LinearRegression()
)

I have TfidfVectorizer pretrained, so when I am calling pipeline.fit(X, y) I want only LinearRegression to be fitted and I don't want to refit TfidfVectorizer.

I am able to just apply transformation in advance and fit LinearRegression on transformed data, but in my project I have a lot of transformers in a pipeline, where some of them are pretrained and some aren't, so I am searching for a way of not writing another wrapper around sklearn estimators and stay in a bounds of one Pipeline object.

To my mind, it should be a parameter in the estimators object that stands for not refitting object when calling .fit() if object is already fitted.

Upvotes: 7

Views: 1755

Answers (3)

Ivan Reshetnikov
Ivan Reshetnikov

Reputation: 410

You can use this hack to fit transformer only once

from sklearn.preprocessing import FunctionTransformer

def fit_once(transformer):
    fitted = [False]

    def func(x):
        if not fitted[0]:
            transformer.fit(x)
            fitted[0] = True
        return transformer.transform(x)

    return FunctionTransformer(func)

pipeline = make_pipeline(
    fit_once(TfidfVectorizer()),
    LinearRegression()
)

Upvotes: 0

Ivan Reshetnikov
Ivan Reshetnikov

Reputation: 410

Look at "memory" parameter. It caches transformers from a pipeline.

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html

pipeline = make_pipeline(
    TfidfVectorizer(),
    LinearRegression(),
    memory='cache_directory'
)

Upvotes: 3

Rafa
Rafa

Reputation: 684

You can find only the regressor by defining your pipeline as follows:

pipeline = make_pipeline(steps = [
    ('vectorizer', TfidfVectorizer()),
    ('regressor', LinearRegression())
])

and then

pipeline['regressor']

should give you only the regressor.

Upvotes: 0

Related Questions