How to combine a pipeline for all types of features, for categorical features and numerical features in one ColumnTransformerr?

Im trying to create a pipeline that combines :

  1. Pipeline for all kinds of features, no matter the type (cleaning incorrect data by feature)
  2. Pipeline for categorical features (categorical imputer)
  3. Pipeline for numerical features (numerical imputer)

in a sklearn.compose.ColumnTransformer¶.

This here is a piece of code for what I'm trying to do

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

alltypes = Pipeline([
    ('column_name_normalizer',ColumnNameNormalizer()),
    ('column_incorrect_data_cleaner',ColumnIncorrectDataCleaner(some_parameter),
])

num_pipeline = Pipeline([
    ('imputer',CustomNumImputer(some_parameter)), # remplir les valeurs manquants
])

cat_pipeline = Pipeline([
    ("cat", CustomCatImputer(some_parameter))
])

full_pipeline = ColumnTransformer([
        ("alltypes",alltypes,allcolumns),
        ("num", num_pipeline, numfeat),
        ("cat",cat_pipeline,catfeat)
])

try:
    X = pd.DataFrame(full_pipeline.fit_transform(X).toarray())
except AttributeError:
    X = pd.DataFrame(full_pipeline.fit_transform(X))

However in the end I get a dataframe with more number of features than at the beginning which is due to the fact that all the features from the pipelines are concatenated, instead of an operator UNION being performed on them:

For instance I want to do some transformations on all features, then do some transformations on categorical features, and do some transformations on numerical features, but I want the outputing dataframe to be always the same size.

Do you know how can I fix this?

Upvotes: 1

Views: 610

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12602

You need to combine the sequential power of Pipeline, e.g.

cat_num_split = ColumnTransformer([
    ("num", num_pipeline, numfeat),
    ("cat", cat_pipeline, catfeat),
])
full_pipeline = Pipeline([
    ("alltypes", alltypes),
    ("cat_num", cat_num_split),
)]

There is a catch here: the alltypes transformer will result in a numpy array without information about which columns are which; your cat_num_split feature lists numfeat and catfeat will rely on your knowledge of which columns are which, and cannot use the column names.

An alternative, that doesn't run into the feature name issue, is to switch the order here.

num_full_pipe = Pipeline([
    ("common", alltypes),
    ("num", num_pipeline),
])
cat_full_pipe = Pipeline([
    ("common", alltypes),
    ("cat", cat_pipeline),
])
full_pipeline = ColumnTransformer([
    ("num", num_full_pipe, numfeat),
    ("cat", cat_full_pipe, catfeat),
])

See also Consistent ColumnTransformer for intersecting lists of columns.

Upvotes: 1

Related Questions