Reputation: 420
I am following the machine learning book from Aurelion Geron.
I am experimenting with the ColumnTransformer
class. When I include SimplerImputer
, an additional columns was created. I understand that SimplerImputer
is for filling up missing value in column total_bedrooms
(column index 4 in result) , hence I am not clear why it is adding new column (column index: 10) in result.
When i do not include
the SimplerImputer
from ColumnTransformer
, but create an instance, and fit_transform
the output of the ColumnTransformer
, i will not get the additional column. Please advise.
category_att = X.select_dtypes(include='object').columns
num_att = X.select_dtypes(include='number').columns
transformer = ColumnTransformer(
[
('adder', AttributeAdder(), num_att ),
('imputer', SimpleImputer(strategy='median'), ['total_bedrooms']),
('ohe', OneHotEncoder(), category_att)
],
remainder = 'passthrough'
)
Custom Class for adding two new feature/columns
class AttributeAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bed_room = False):
self.add_bed_room = add_bed_room
def fit(self,y=None):
return self
def transform(self,X,y=None):
room_per_household = X.iloc[: , t_room ] / X.iloc[: , t_household ]
population_per_household = X.iloc[: , t_population ] / X.iloc[: , t_household ]
return np.c_[X,room_per_household,population_per_household]
Upvotes: 3
Views: 2058
Reputation: 12602
It's not the SimpleImputer
exactly; it's the ColumnTransformer
itself. ColumnTransformer
applies its transformers in parallel, not sequentially (see also [1], [2]), so if a column gets passed to multiple transformers, you'll end up with that column multiple times in the output. In your case, output column 4 comes from the "adder"
on total_bedrooms
(which has done nothing, so has missing values still), and output column 10 comes from the "imputer"
(and so will have no missing values).
In this particular case, two approaches seem the easiest.
Any of your numeric features that don't have missings won't be affected. However, if you want the pipeline to error out on future data that has missing values, then don't do this.
num_pipe = Pipeline([
("add_feat", AttributeAdder()),
("impute", SimpleImputer(strategy="median")),
])
transformer = ColumnTransformer(
[
('num', num_pipe, num_att),
('cat', OneHotEncoder(), category_att),
],
remainder = 'passthrough',
)
Since you don't actually need your total_bedrooms
column for your AttributeAdder
, you don't need to pass it into that transformer. The specifics of this will depend on how you're using t_rooms
, t_households
, etc., but generally:
transformer = ColumnTransformer(
[
('adder', AttributeAdder(), [["total_rooms", "households", "population"]]),
('imputer', SimpleImputer(strategy='median'), ['total_bedrooms']),
('ohe', OneHotEncoder(), category_att)
],
remainder = 'passthrough' # now you're relying on this one much more
)
In a related approach, you have more flexibility in how the added features are computed. Change your AttributeAdder
to return just the new features (don't concatenate to X
in the last step of transform
), and rely on the ColumnTransformer
to pass those features along. (Note that we can't rely on remainder
for those, but we can use "passthrough"
as one of the transformers.)
class AttributeAdder(BaseEstimator, TransformerMixin):
...
def transform(self,X,y=None):
...
return np.c_[room_per_household,population_per_household]
transformer = ColumnTransformer(
[
('adder', AttributeAdder(), num_att),
('num', "passthrough", num_att.drop(['total_bedrooms'])),
('imputer', SimpleImputer(strategy='median'), ['total_bedrooms']),
('ohe', OneHotEncoder(), category_att)
],
passthrough=True, # if you have columns in neither of num_att and category_att that you want kept
)
Upvotes: 5