Reputation: 87
I have a dataframe composed with 100 features being used to a cluster problem. Those features are divided into 3 blocks of features N1
, N2
and N3
and all features have, as a suffix, the correspondent group. As example, the name of the feature might be:
umidity_n1, air_n1, lat_n2, long_n2, etc..
So, by now, I am applying, in my pipeline, PCA to the whole data, where I would like that the PCA was applied by groups. So, one PCA for features with _n1
suffix, one PCA for features with _n2
suffix and, another PCA for features with _n3
suffix.
My pipeline is working as:
## Pipeline
prepData = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA(n_components=20, random_state=42)),
]
)
kModel = Pipeline(
[
(
"kmeans",
KMeans(
n_clusters=6,
init="k-means++",
n_init=20,
max_iter=100,
random_state=42,
),
),
]
)
pipe = Pipeline(
[
("prepData", prepData),
("kModel", kModel)
]
)
Any ideas how to split the PCA procedure by blocks of variables inside the above pipeline?
Upvotes: 0
Views: 983
Reputation: 46968
You can use ColumnTransformer
to transfor the columns separately with a pca. From the help page for ColumnTransformer, you pass the index of the columns you want to transform with each pca, and below I use (for to get columns with n1 suffix) :
np.where(df.columns.str.contains('_n1'))[0]
An example data:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
df = pd.DataFrame(np.random.uniform(0,1,(100,6)),
columns = ['umidity_n1','air_n1','a_n1','lat_n2','long_n2','b_n2'])
Set up the column transformer and pipeline:
pca = PCA(n_components=2)
pca_by_column = ColumnTransformer(transformers=[
('pca_n1', pca, np.where(df.columns.str.contains('_n1'))[0]),
('pca_n2', pca, np.where(df.columns.str.contains('_n2'))[0])
],
remainder='passthrough')
prepData = Pipeline(steps=[
("scaler", StandardScaler()),
('pca', pca_by_column)
])
Upvotes: 2