Reputation: 71
Consider having following sklearn Pipeline
:
pipeline = make_pipeline(
TfidfVectorizer(),
LinearRegression()
)
I have TfidfVectorizer
pretrained, so when I am calling pipeline.fit(X, y)
I want only LinearRegression
to be fitted and I don't want to refit TfidfVectorizer
.
I am able to just apply transformation in advance and fit LinearRegression
on transformed data, but in my project I have a lot of transformers in a pipeline, where some of them are pretrained and some aren't, so I am searching for a way of not writing another wrapper around sklearn estimators and stay in a bounds of one Pipeline
object.
To my mind, it should be a parameter in the estimators object that stands for not refitting object when calling .fit()
if object is already fitted.
Upvotes: 7
Views: 1755
Reputation: 410
You can use this hack to fit transformer only once
from sklearn.preprocessing import FunctionTransformer
def fit_once(transformer):
fitted = [False]
def func(x):
if not fitted[0]:
transformer.fit(x)
fitted[0] = True
return transformer.transform(x)
return FunctionTransformer(func)
pipeline = make_pipeline(
fit_once(TfidfVectorizer()),
LinearRegression()
)
Upvotes: 0
Reputation: 410
Look at "memory" parameter. It caches transformers from a pipeline.
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
pipeline = make_pipeline(
TfidfVectorizer(),
LinearRegression(),
memory='cache_directory'
)
Upvotes: 3
Reputation: 684
You can find only the regressor by defining your pipeline as follows:
pipeline = make_pipeline(steps = [
('vectorizer', TfidfVectorizer()),
('regressor', LinearRegression())
])
and then
pipeline['regressor']
should give you only the regressor.
Upvotes: 0