kanimbla
kanimbla

Reputation: 890

sklearn pipeline with PCA on feature subset using FunctionTransformer

Consider the task of chaining a PCA and regression, where PCA performs dimensionality reduction and regression does the prediction.

Example taken from the sklearn documentation:

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)

param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe,param_grid)
estimator.fit(X_digits, y_digits)

How can I perform dimensionality reduction only on a subset of my feature set using FunctionTransformer (for example, restrict PCA to the last ten columns of X_digits)?

Upvotes: 2

Views: 1388

Answers (2)

Mutlu Simsek
Mutlu Simsek

Reputation: 1172

The accepted answer selects features also for logistic regression. If this is not the desired behavior, you can use ColumnTransfor.

https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

pca_transformer = ColumnTransformer([('pca', PCA(), [-10:])], remainder="passthrough")
pipe = Pipeline(steps=[('pca_transformer', pca_transformer), ('logistic', logistic)])

Upvotes: 0

A Kruger
A Kruger

Reputation: 2419

You can first create a function (called last_ten_columns below) that returns the last 10 columns of the input X_digits. Create the function transformer that points to the function, and use it as the first step of the pipeline.

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import FunctionTransformer

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()

def last_ten_columns(X):
    return X[:, -10:]

func_trans = FunctionTransformer(last_ten_columns)

pipe = Pipeline(steps=[('func_trans',func_trans), ('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)

param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe, param_grid)
estimator.fit(X_digits, y_digits)

Upvotes: 1

Related Questions