Reputation: 110004
I have a data set and want to apply scaling and then PCA to a subset of a pandas dataframe and return just the components and the columns not being transformed. So using the mpg
data set from seaborn I can see the training set trying to predict mpg looks like this:
Now let's say I want to leave cylinders and discplacement alone and scale everything else and reduce it to 2 components. I'd expect the result to be 4 total columns, the original 2 plus the 2 components.
How can I use ColumnTransformer
to do the scaling to a subset of columns, then the PCA and return only the components and the 2 passthrough columns?
MWE
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
pd.DataFrame(trans)
I strongly suspect my misconception of how this step works is wrong: preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)]
. I think it operates on the last 4 columns, first doing a scale and then PCA and final returns the 2 components but I get 8 columns, the first 4 are scale, the next 2 appear to be the components (likely they weren't scale first), and lastly, the two columns I 'passthrough'
.
Upvotes: 2
Views: 2037
Reputation: 110004
I think this works but don't know if this is the way Python/scikit way to solve it:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
dtm_i2 = list(range(0, len(X_train.columns)-2))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i)], remainder='passthrough')
preprocess2 = ColumnTransformer(transformers=[('PCA DTM', pca, dtm_i2)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
trans = preprocess2.fit_transform(trans)
pd.DataFrame(trans)
Upvotes: 2