Gustavomoty
Gustavomoty

Reputation: 87

Make PCA by group of features to Scikit-Learn Pipeline instead to the whole features

I have a dataframe composed with 100 features being used to a cluster problem. Those features are divided into 3 blocks of features N1, N2 and N3 and all features have, as a suffix, the correspondent group. As example, the name of the feature might be:

umidity_n1, air_n1, lat_n2, long_n2, etc..

So, by now, I am applying, in my pipeline, PCA to the whole data, where I would like that the PCA was applied by groups. So, one PCA for features with _n1 suffix, one PCA for features with _n2 suffix and, another PCA for features with _n3 suffix.

My pipeline is working as:

## Pipeline
prepData = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=20, random_state=42)),
    ]
)

kModel = Pipeline(
    [
        (
            "kmeans",
                KMeans(
                    n_clusters=6,
                    init="k-means++",
                    n_init=20,
                    max_iter=100,
                    random_state=42,
                ),
        ),
    ]
)

pipe = Pipeline(
    [
        ("prepData", prepData),
        ("kModel", kModel)
    ]
)

Any ideas how to split the PCA procedure by blocks of variables inside the above pipeline?

Upvotes: 0

Views: 983

Answers (1)

StupidWolf
StupidWolf

Reputation: 46968

You can use ColumnTransformer to transfor the columns separately with a pca. From the help page for ColumnTransformer, you pass the index of the columns you want to transform with each pca, and below I use (for to get columns with n1 suffix) :

np.where(df.columns.str.contains('_n1'))[0]

An example data:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

df = pd.DataFrame(np.random.uniform(0,1,(100,6)),
columns = ['umidity_n1','air_n1','a_n1','lat_n2','long_n2','b_n2'])

Set up the column transformer and pipeline:

pca = PCA(n_components=2)

pca_by_column = ColumnTransformer(transformers=[
    ('pca_n1', pca, np.where(df.columns.str.contains('_n1'))[0]),
    ('pca_n2', pca, np.where(df.columns.str.contains('_n2'))[0])
    ],
    remainder='passthrough')

prepData = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ('pca', pca_by_column)
])

Upvotes: 2

Related Questions