Reputation: 3318
I got this from the sklearn
webpage:
Pipeline: Pipeline of transforms with a final estimator
Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.
But I still do not understand when I have to use each one. Can anyone give me an example?
Upvotes: 90
Views: 42952
Reputation: 1257
In scikit-learn, both Pipeline and make_pipeline are used to create a sequence of transformations and estimators that can be treated as a single unit.
Pipeline: This requires you to explicitly name each step in the sequence.
make_pipeline: This automatically assigns names to each step based on the class names of the estimators.
# Define the pipeline with explicitly named steps
pipeline = Pipeline([
('scaler', StandardScaler()), # Explicitly name the step 'scaler'
('pca', PCA(n_components=2)) # Explicitly name the step 'pca'
])
here in this example we explicitly name each step: 'scaler' for StandardScaler and 'pca' for PCA. This can be useful for referencing specific steps later, especially when you need to access or modify them.
# Define the pipeline using make_pipeline
pipeline = make_pipeline(
StandardScaler(), # No need to explicitly name the step
PCA(n_components=2) # No need to explicitly name the step
)
In this example, make_pipeline automatically names the steps based on their class names, so you don't need to specify names manually. The steps will be named 'standardscaler' and 'pca' respectively.
Both approaches will yield the same result in terms of the transformations applied to the data.
Upvotes: 0
Reputation: 23459
If we look at the source code, make_pipeline()
creates a Pipeline
object, so they are equivalent. As mentioned by @Mikhail Korobov, the only difference is that make_pipeline()
doesn't admit estimator names and they are set to the lowercase of their types. In other words, type(estimator).__name__.lower()
is used to create estimator names (source). So it's really a simpler form of building a pipeline.
On a related note, to get parameter names you can use get_params()
. This is useful if you want to know the parameter names for GridSearch()
. The parameter names are created by concatenating the estimator names with their kwargs recursively (e.g. max_iter
of a LogisticRegression()
is stored as 'logisticregression__max_iter'
or C
parameter in OneVsRestClassifier(LogisticRegression())
as 'onevsrestclassifier__estimator__C'
; the latter because when written using kwargs, it is OneVsRestClassifier(estimator=LogisticRegression())
).
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
X, y = make_classification()
pipe = make_pipeline(PCA(), LogisticRegression())
print(pipe.get_params())
# {'memory': None,
# 'steps': [('pca', PCA()), ('logisticregression', LogisticRegression())],
# 'verbose': False,
# 'pca': PCA(),
# 'logisticregression': LogisticRegression(),
# 'pca__copy': True,
# 'pca__iterated_power': 'auto',
# 'pca__n_components': None,
# 'pca__n_oversamples': 10,
# 'pca__power_iteration_normalizer': 'auto',
# 'pca__random_state': None,
# 'pca__svd_solver': 'auto',
# 'pca__tol': 0.0,
# 'pca__whiten': False,
# 'logisticregression__C': 1.0,
# 'logisticregression__class_weight': None,
# 'logisticregression__dual': False,
# 'logisticregression__fit_intercept': True,
# 'logisticregression__intercept_scaling': 1,
# 'logisticregression__l1_ratio': None,
# 'logisticregression__max_iter': 100,
# 'logisticregression__multi_class': 'auto',
# 'logisticregression__n_jobs': None,
# 'logisticregression__penalty': 'l2',
# 'logisticregression__random_state': None,
# 'logisticregression__solver': 'lbfgs',
# 'logisticregression__tol': 0.0001,
# 'logisticregression__verbose': 0,
# 'logisticregression__warm_start': False}
# use the params from above to construct param_grid
param_grid = {'pca__n_components': [2, None], 'logisticregression__C': [0.1, 1]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
best_score = gs.score(X, y)
Circling back to Pipeline
vs make_pipeline
; Pipeline
gives you more flexibility in naming parameters but if you name each estimator using lowercase of its type, then Pipeline
and make_pipeline
they will both have the same params and steps attributes.
pca = PCA()
lr = LogisticRegression()
make_pipe = make_pipeline(pca, lr)
pipe = Pipeline([('pca', pca), ('logisticregression', lr)])
make_pipe.get_params() == pipe.get_params() # True
make_pipe.steps == pipe.steps # True
Upvotes: 2
Reputation: 22248
The only difference is that make_pipeline
generates names for steps automatically.
Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:
pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
compare it with make_pipeline:
pipe = make_pipeline(CountVectorizer(), LogisticRegression())
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
So, with Pipeline
:
clf__C
.make_pipeline
:
When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.
Upvotes: 143