SoraHeart
SoraHeart

Reputation: 420

Why is SimpleImputer in ColumnTransformer creating additional Columns?

I am following the machine learning book from Aurelion Geron.

I am experimenting with the ColumnTransformer class. When I include SimplerImputer, an additional columns was created. I understand that SimplerImputer is for filling up missing value in column total_bedrooms (column index 4 in result) , hence I am not clear why it is adding new column (column index: 10) in result.

When i do not include the SimplerImputer from ColumnTransformer, but create an instance, and fit_transform the output of the ColumnTransformer, i will not get the additional column. Please advise.

category_att = X.select_dtypes(include='object').columns
num_att = X.select_dtypes(include='number').columns

transformer = ColumnTransformer(
    [
    ('adder', AttributeAdder(), num_att ),
    ('imputer', SimpleImputer(strategy='median'), ['total_bedrooms']),
    ('ohe', OneHotEncoder(), category_att)
    ],
    remainder = 'passthrough'
)

Custom Class for adding two new feature/columns

class AttributeAdder(BaseEstimator, TransformerMixin):
    
    def __init__(self, add_bed_room = False):
        self.add_bed_room = add_bed_room
    
    def fit(self,y=None):
        return self
    
    def transform(self,X,y=None):
        
        room_per_household = X.iloc[: , t_room ] / X.iloc[: , t_household ]
        population_per_household = X.iloc[: , t_population ] / X.iloc[: , t_household ]
        return np.c_[X,room_per_household,population_per_household]

Results enter image description here

Upvotes: 3

Views: 2058

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12602

Why

It's not the SimpleImputer exactly; it's the ColumnTransformer itself. ColumnTransformer applies its transformers in parallel, not sequentially (see also [1], [2]), so if a column gets passed to multiple transformers, you'll end up with that column multiple times in the output. In your case, output column 4 comes from the "adder" on total_bedrooms (which has done nothing, so has missing values still), and output column 10 comes from the "imputer" (and so will have no missing values).

Fixes

In this particular case, two approaches seem the easiest.

Impute everything

Any of your numeric features that don't have missings won't be affected. However, if you want the pipeline to error out on future data that has missing values, then don't do this.

num_pipe = Pipeline([
    ("add_feat", AttributeAdder()),
    ("impute", SimpleImputer(strategy="median")),
])
transformer = ColumnTransformer(
    [
        ('num', num_pipe, num_att),
        ('cat', OneHotEncoder(), category_att),
    ],
    remainder = 'passthrough',
)

Smaller transformer column sets

Since you don't actually need your total_bedrooms column for your AttributeAdder, you don't need to pass it into that transformer. The specifics of this will depend on how you're using t_rooms, t_households, etc., but generally:

transformer = ColumnTransformer(
    [
        ('adder', AttributeAdder(), [["total_rooms", "households", "population"]]),
        ('imputer', SimpleImputer(strategy='median'), ['total_bedrooms']),
        ('ohe', OneHotEncoder(), category_att)
    ],
    remainder = 'passthrough'  # now you're relying on this one much more
)

In a related approach, you have more flexibility in how the added features are computed. Change your AttributeAdder to return just the new features (don't concatenate to X in the last step of transform), and rely on the ColumnTransformer to pass those features along. (Note that we can't rely on remainder for those, but we can use "passthrough" as one of the transformers.)

class AttributeAdder(BaseEstimator, TransformerMixin):
    ...
    def transform(self,X,y=None):
        ...
        return np.c_[room_per_household,population_per_household]


transformer = ColumnTransformer(
    [
        ('adder', AttributeAdder(), num_att),
        ('num', "passthrough", num_att.drop(['total_bedrooms'])),
        ('imputer', SimpleImputer(strategy='median'), ['total_bedrooms']),
        ('ohe', OneHotEncoder(), category_att)
    ],
    passthrough=True,  # if you have columns in neither of num_att and category_att that you want kept
)

Upvotes: 5

Related Questions