pipeline with PCA on feature subset only in scikit-learn

Question

I have a set of features that I would like to model, one of which is actually a histogram sampled at 100 different points. Thus this histogram feature is actually 100 different features. I would like to reduce the dimensionality of my modeling problem by performing PCA on the histogram features, however I do not want to include the other features in the PCA in order to maintain interpretability of my model.

Ideally I would like to form a pipeline with the PCA to transform the histogram features and SVC to perform the fitting, which I would the feed to GridSearchCV to determine the SVC hyperparameters. Is it somehow possible in this setup to have PCA transform only a subset of my features (the histogram bins)? The easiest way would be to edit the PCA object to accept a feature mask, but I would certainly prefer to use existing functionality.

EDIT

After implementing @eickenberg's answer I realized that I also wanted an inverse_transform method for the new PCA class. This method recreates the initial feature set with columns in their original order. It is provided below for anyone else who is interested:

def inverse_transform(self, X):
    if self.mask is not None:
        # Inverse transform appropriate data
        inv_mask = np.arange(len(X[0])) >= sum(~self.mask)
        inv_transformed = self.pca.inverse_transform(X[:, inv_mask])

        # Place inverse transformed columns back in their original order
        inv_transformed_reorder = np.zeros([len(X), len(self.mask)])
        inv_transformed_reorder[:, self.mask] = inv_transformed
        inv_transformed_reorder[:, ~self.mask] = X[:, ~inv_mask]
        return inv_transformed_reorder
    else:
        return self.pca.inverse_transform(X)

eickenberg · Accepted Answer

This is not possible straight out of the box with scikit learn. In order to be able to exploit full functionality of Pipeline and GridSearchCV, consider creating an object MaskedPCA, inheriting from sklearn.base.BaseEstimator and exposing the methods fit and transform. In it you should use a PCA object on your masked features. The mask should be passed to the constructor.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import PCA

class MaskedPCA(BaseEstimator, TransformerMixin):

    def __init__(self, n_components=2, mask=None):  
        # mask should contain selected cols. Suppose it is boolean to avoid code overhead
        self.n_components = n_components
        self.mask = mask

    def fit(self, X):
        self.pca = PCA(n_components=self.n_components)
        mask = self.mask
        mask = self.mask if self.mask is not None else slice(None)
        self.pca.fit(X[:, mask])
        return self

    def transform(self, X):
        mask = self.mask if self.mask is not None else slice(None)
        pca_transformed = self.pca.transform(X[:, mask])
        if self.mask is not None:
            remaining_cols = X[:, ~mask]
            return np.hstack([remaining_cols, pca_transformed])
        else:
            return pca_transformed

You can test it on some generated data

import numpy as np
X = np.random.randn(100, 20)
mask = np.arange(20) > 4

mpca = MaskedPCA(n_components=2, mask=mask)

transformed = mpca.fit(X).transform(X)

# check whether first five columns are equal
from numpy.testing import assert_array_equal
assert_array_equal(X[:, :5], transformed[:, :5])

Observe that transformed now has (~mask).sum + mpca.n_components == 7 columns

pipeline with PCA on feature subset only in scikit-learn

Answers (2)

Related Questions