Reputation: 57
Im trying to create a pipeline that combines :
in a sklearn.compose.ColumnTransformer¶.
This here is a piece of code for what I'm trying to do
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
alltypes = Pipeline([
('column_name_normalizer',ColumnNameNormalizer()),
('column_incorrect_data_cleaner',ColumnIncorrectDataCleaner(some_parameter),
])
num_pipeline = Pipeline([
('imputer',CustomNumImputer(some_parameter)), # remplir les valeurs manquants
])
cat_pipeline = Pipeline([
("cat", CustomCatImputer(some_parameter))
])
full_pipeline = ColumnTransformer([
("alltypes",alltypes,allcolumns),
("num", num_pipeline, numfeat),
("cat",cat_pipeline,catfeat)
])
try:
X = pd.DataFrame(full_pipeline.fit_transform(X).toarray())
except AttributeError:
X = pd.DataFrame(full_pipeline.fit_transform(X))
However in the end I get a dataframe with more number of features than at the beginning which is due to the fact that all the features from the pipelines are concatenated, instead of an operator UNION being performed on them:
For instance I want to do some transformations on all features, then do some transformations on categorical features, and do some transformations on numerical features, but I want the outputing dataframe to be always the same size.
Do you know how can I fix this?
Upvotes: 1
Views: 610
Reputation: 12602
You need to combine the sequential power of Pipeline
, e.g.
cat_num_split = ColumnTransformer([
("num", num_pipeline, numfeat),
("cat", cat_pipeline, catfeat),
])
full_pipeline = Pipeline([
("alltypes", alltypes),
("cat_num", cat_num_split),
)]
There is a catch here: the alltypes
transformer will result in a numpy array without information about which columns are which; your cat_num_split
feature lists numfeat
and catfeat
will rely on your knowledge of which columns are which, and cannot use the column names.
An alternative, that doesn't run into the feature name issue, is to switch the order here.
num_full_pipe = Pipeline([
("common", alltypes),
("num", num_pipeline),
])
cat_full_pipe = Pipeline([
("common", alltypes),
("cat", cat_pipeline),
])
full_pipeline = ColumnTransformer([
("num", num_full_pipe, numfeat),
("cat", cat_full_pipe, catfeat),
])
See also Consistent ColumnTransformer for intersecting lists of columns.
Upvotes: 1