I got this from the sklearn webpage: Pipeline : Pipeline of transforms with a final estimator Make_pipeline : Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor. But I still do not understand when I have to use each one. Can anyone give me an example?

The only difference is that make_pipeline generates names for steps automatically. Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline: pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()]) param_grid = [{'clf__C': [1, 10, 100, 1000]} gs = GridSearchCV(pipe, param_grid) gs.fit(X, y) compare it with make_pipeline: pipe = make_pipeline(CountVectorizer(), LogisticRegression()) param_grid = [{'logisticregression__C': [1, 10, 100, 1000]} gs = GridSearchCV(pipe, param_grid) gs.fit(X, y) So, with Pipeline : names are explicit, you don't have to figure them out if you need them; name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C . make_pipeline : shorter and arguably more readable notation; names are auto-generated using a straightforward rule (lowercase name of an estimator). When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.

pythonmachine-learningscikit-learnpipeline

Aizzaac

Reputation: 3318

What is the difference between pipeline and make_pipeline in scikit-learn?

I got this from the sklearn webpage:

Pipeline: Pipeline of transforms with a final estimator
Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.

But I still do not understand when I have to use each one. Can anyone give me an example?

Upvotes: 90

Answers (3)

akD

Reputation: 1257

In scikit-learn, both Pipeline and make_pipeline are used to create a sequence of transformations and estimators that can be treated as a single unit.

Pipeline: This requires you to explicitly name each step in the sequence.

make_pipeline: This automatically assigns names to each step based on the class names of the estimators.

# Define the pipeline with explicitly named steps
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Explicitly name the step 'scaler'
    ('pca', PCA(n_components=2))   # Explicitly name the step 'pca'
])

here in this example we explicitly name each step: 'scaler' for StandardScaler and 'pca' for PCA. This can be useful for referencing specific steps later, especially when you need to access or modify them.

# Define the pipeline using make_pipeline
pipeline = make_pipeline(
    StandardScaler(),  # No need to explicitly name the step
    PCA(n_components=2)  # No need to explicitly name the step
)

In this example, make_pipeline automatically names the steps based on their class names, so you don't need to specify names manually. The steps will be named 'standardscaler' and 'pca' respectively.

Both approaches will yield the same result in terms of the transformations applied to the data.

Use Pipeline when you need more control over the names of the steps, especially if you plan to modify or access specific steps later.
Use make_pipeline for quicker and cleaner code when you don't need to explicitly name each step.

Upvotes: 0

cottontail

Reputation: 23459

If we look at the source code, make_pipeline() creates a Pipeline object, so they are equivalent. As mentioned by @Mikhail Korobov, the only difference is that make_pipeline() doesn't admit estimator names and they are set to the lowercase of their types. In other words, type(estimator).__name__.lower() is used to create estimator names (source). So it's really a simpler form of building a pipeline.

On a related note, to get parameter names you can use get_params(). This is useful if you want to know the parameter names for GridSearch(). The parameter names are created by concatenating the estimator names with their kwargs recursively (e.g. max_iter of a LogisticRegression() is stored as 'logisticregression__max_iter' or C parameter in OneVsRestClassifier(LogisticRegression()) as 'onevsrestclassifier__estimator__C'; the latter because when written using kwargs, it is OneVsRestClassifier(estimator=LogisticRegression())).

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification

X, y = make_classification()
pipe = make_pipeline(PCA(), LogisticRegression())

print(pipe.get_params())

# {'memory': None,
#  'steps': [('pca', PCA()), ('logisticregression', LogisticRegression())],
#  'verbose': False,
#  'pca': PCA(),
#  'logisticregression': LogisticRegression(),
#  'pca__copy': True,
#  'pca__iterated_power': 'auto',
#  'pca__n_components': None,
#  'pca__n_oversamples': 10,
#  'pca__power_iteration_normalizer': 'auto',
#  'pca__random_state': None,
#  'pca__svd_solver': 'auto',
#  'pca__tol': 0.0,
#  'pca__whiten': False,
#  'logisticregression__C': 1.0,
#  'logisticregression__class_weight': None,
#  'logisticregression__dual': False,
#  'logisticregression__fit_intercept': True,
#  'logisticregression__intercept_scaling': 1,
#  'logisticregression__l1_ratio': None,
#  'logisticregression__max_iter': 100,
#  'logisticregression__multi_class': 'auto',
#  'logisticregression__n_jobs': None,
#  'logisticregression__penalty': 'l2',
#  'logisticregression__random_state': None,
#  'logisticregression__solver': 'lbfgs',
#  'logisticregression__tol': 0.0001,
#  'logisticregression__verbose': 0,
#  'logisticregression__warm_start': False}

# use the params from above to construct param_grid
param_grid = {'pca__n_components': [2, None], 'logisticregression__C': [0.1, 1]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

best_score = gs.score(X, y)

Circling back to Pipeline vs make_pipeline; Pipeline gives you more flexibility in naming parameters but if you name each estimator using lowercase of its type, then Pipeline and make_pipeline they will both have the same params and steps attributes.

pca = PCA()
lr = LogisticRegression()
make_pipe = make_pipeline(pca, lr)
pipe = Pipeline([('pca', pca), ('logisticregression', lr)])

make_pipe.get_params() == pipe.get_params()   # True
make_pipe.steps == pipe.steps                 # True

Upvotes: 2

Mikhail Korobov

Reputation: 22248

The only difference is that make_pipeline generates names for steps automatically.

Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:

pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

compare it with make_pipeline:

pipe = make_pipeline(CountVectorizer(), LogisticRegression())     
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

So, with Pipeline:

names are explicit, you don't have to figure them out if you need them;
name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.

make_pipeline:

shorter and arguably more readable notation;
names are auto-generated using a straightforward rule (lowercase name of an estimator).

When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.

Upvotes: 143

What is the difference between pipeline and make_pipeline in scikit-learn?

Answers (3)

Related Questions