thibaultbl
thibaultbl

Reputation: 984

Get intermediate data state in scikit-learn Pipeline

Given the following example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd

pipe = Pipeline([
    ("tf_idf", TfidfVectorizer()),
    ("nmf", NMF())
])

data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]

pipe.fit_transform(data.test)

I would like to get intermediate data state in scikit learn pipeline corresponding to tf_idf output (after fit_transform on tf_idf but not NMF) or NMF input. Or to say things in another way, it would be the same than to apply

TfidfVectorizer().fit_transform(data.test)

I know pipe.named_steps["tf_idf"] ti get intermediate transformer, but I can't get data, only parameters of the transformer with this method.

Upvotes: 33

Views: 15190

Answers (7)

David Gilbertson
David Gilbertson

Reputation: 4883

Slicing a pipeline returns a new pipeline with a subset of the steps.

For example, you can get a sub-pipeline (without the last step) using pipe[:-1]

transformed = pipe[:-1].fit_transform(X, y)  # what the final estimator gets

If you've already called fit() then change fit_transform to transform.

If you want a Pandas dataframe with column names (and you're transforming columns too), you can do:

sub_pipe = pipe[:-1]
transformed_df = pd.DataFrame(
    data=sub_pipe.fit_transform(X, y),
    columns=sub_pipe.get_feature_names_out(),
)

If you're calling transform multiple times, you'll probably want to supply the memory argument to Pipeline() to cache the results. See Caching transformers: avoid repeated computation for details.

Upvotes: 4

cs_stackX
cs_stackX

Reputation: 1527

Using slicing: model[:-1].transform(X) where model is the Pipeline object. Note that you need to call pipeline.fit(X_train, y_train) on your pipeline object first.

Upvotes: 1

Hans Bouwmeester
Hans Bouwmeester

Reputation: 1519

Here's what I use:

def fit_transform_step(pipe, X, y=None, step_name=None):
    if step_name not in pipe.named_steps:
        raise ValueError(f"step not in Pipeline: {step_name}")
    Xt = X
    for k,v in pipe.steps:
        if v != 'passthrough':
            Xt = v.fit_transform(Xt, y)
        if k==step_name:
            break
    return Xt

call like:

tf_idf_out = fit_transform_step(pipe, data.test, step_name='tf_idf')

Upvotes: 1

Tony B
Tony B

Reputation: 386

I'm not sure exactly what your use case is, but one simple solution is this:

# get feature values by transforming x for each step, except the classifier 

x_intermediate = data.train

for step in pipe.steps[:-1]:
    x_intermediate = step[1].transform(x_intermediate)

print(x_intermediate)

Good luck-
Tony

Upvotes: 1

user394430
user394430

Reputation: 2967

I've create a gist for this. Essentially, from Python 3.2, using the Context Manager, the code below allows for one to retrieve intermediate results into a dict with the names of the pipeline transformers as keys.

with intermediate_transforms(pipe):
    Xt = pipe.transform(X)
    intermediate_results = pipe.intermediate_results__

This is accomplished via the function below, but see my gist for more documentation.

import contextlib
from functools import partial

from sklearn.pipeline import Pipeline

@contextlib.contextmanager
def intermediate_transforms(pipe: Pipeline):
    # Our temporary overload of Pipeline._transform() method.
    # https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/pipeline.py
    def _pipe_transform(self, X):
        Xt = X
        for _, name, transform in self._iter():
            Xt = transform.transform(Xt)
            self.intermediate_results__[name] = Xt
        return Xt

    if not isinstance(pipe, Pipeline):
        raise ValueError(f'"{pipe}" must be a Pipeline.')

    pipe.intermediate_results__ = {}                              
    _transform_before = pipe._transform
    pipe._transform = partial(_pipe_transform, pipe)  # Monkey-patch our _pipe_transform method.
    yield pipe  # Release our patched object to the context
    
    # Restore
    pipe._transform = _transform_before
    delattr(pipe, 'intermediate_results__')

Upvotes: 1

Marcus V.
Marcus V.

Reputation: 6869

As @Vivek Kumar suggested in the comment and as I answered here, I find a debug step that prints information or writes intermediate dataframes to csv useful:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator


class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        print(X.shape)
        self.shape = shape
        # what other output you want
        return X

    def fit(self, X, y=None, **fit_params):
        return self

pipe = Pipeline([
    ("tf_idf", TfidfVectorizer()),
    ("debug", Debug()),
    ("nmf", NMF())
])

data = pd.DataFrame([["Salut comment tu vas", "Hey how are you today", "I am okay and you ?"]]).T
data.columns = ["test"]

pipe.fit_transform(data.test)

Edit

I now added a state to the debug transformer. Now you can access the shape as in the answer by @datasailor with:

pipe.named_steps["debug"].shape

Upvotes: 25

CodeZero
CodeZero

Reputation: 1699

As far as I understand, you want to get the transformed training data. You already fitted the data in pipe.named_steps["tf_idf"], so just use this fitted model to transform the training data again:

pipe.named_steps["tf_idf"].transform(data.test)

Upvotes: 13

Related Questions