anu.agg
anu.agg

Reputation: 197

Error while extracting sub-pipeline using index from sklearn Pipeline

I've a machine learning pipeline --

logreg = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
                   ('tfidf', TfidfTransformer(sublinear_tf=True, use_idf=True)),
                   ('clf', LogisticRegression(n_jobs=-1, C=1e2, multi_class='ovr', 
                                              solver='lbfgs', max_iter=1000))])

logreg.fit(X_train, y_train)

I want to extract the feature matrix from the first two steps of the pipeline. Therefore, I tried to extract the sub-pipeline with first two steps in original pipeline. The following code gives error:

logreg[:-1].fit(X)

TypeError: 'Pipeline' object has no attribute 'getitem'

How do I extract the first two steps of the Pipeline without building a new pipeline for data transformation?

Upvotes: 1

Views: 100

Answers (2)

Venkatachalam
Venkatachalam

Reputation: 16966

I think you are having the old version of sklearn. With the versions >=0.21.3, the indexing of pipeline using the way you did, should be possible.

You can see the release notes here

Example:

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

categories = ['alt.atheism', 'talk.religion.misc']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)

X, y = newsgroups_train.data, newsgroups_train.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y)


logreg = Pipeline([('vect', CountVectorizer(ngram_range=(1, 1))),
                   ('tfidf', TfidfTransformer(sublinear_tf=True, use_idf=True)),
                   ('clf', LogisticRegression(n_jobs=-1, C=1e2,
                                              multi_class='ovr',
                                              solver='lbfgs', max_iter=1000))])
logreg.fit(X_train, y_train)

logreg[:-1].fit_transform(X_train)

# <599x15479 sparse matrix of type '<class 'numpy.float64'>'
#   with 107539 stored elements in Compressed Sparse Row format>

Upvotes: 0

Wickkiey
Wickkiey

Reputation: 4642

I you want to execute only part of the steps you can create Pipeline in runtime.

partial_pipe = Pipeline(logreg.steps[:-1])
partial_pipe.fit(data)

The steps of piple will be available in steps variable of Pipeline object.

Upvotes: 1

Related Questions