Apply scaling and pca to a subset of columns in ColumnTransformer

Question

I have a data set and want to apply scaling and then PCA to a subset of a pandas dataframe and return just the components and the columns not being transformed. So using the mpg data set from seaborn I can see the training set trying to predict mpg looks like this:

Now let's say I want to leave cylinders and discplacement alone and scale everything else and reduce it to 2 components. I'd expect the result to be 4 total columns, the original 2 plus the 2 components.

How can I use ColumnTransformer to do the scaling to a subset of columns, then the PCA and return only the components and the 2 passthrough columns?

MWE

import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer

df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()

X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21) 


scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))

preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)

pd.DataFrame(trans)

I strongly suspect my misconception of how this step works is wrong: preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)]. I think it operates on the last 4 columns, first doing a scale and then PCA and final returns the 2 components but I get 8 columns, the first 4 are scale, the next 2 appear to be the components (likely they weren't scale first), and lastly, the two columns I 'passthrough'.

Tyler Rinker · Accepted Answer

I think this works but don't know if this is the way Python/scikit way to solve it:

import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer

df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()

X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21) 


scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
dtm_i2 = list(range(0, len(X_train.columns)-2))

preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i)], remainder='passthrough')
preprocess2 = ColumnTransformer(transformers=[('PCA DTM', pca, dtm_i2)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
trans = preprocess2.fit_transform(trans)

pd.DataFrame(trans)

Apply scaling and pca to a subset of columns in ColumnTransformer

Answers (1)

Related Questions