Reputation: 2680
I am trying to code a custom transformer to be used in a pipeline to pre-process data.
Here is the code I'm using (sourced - not written by me). It takes in a dataframe, scales the features, and returns a dataframe:
class DFStandardScaler(BaseEstimator,TransformerMixin):
def __init__(self):
self.ss = None
def fit(self,X,y=None):
self.ss = StandardScaler().fit(X)
return self
def transform(self, X):
Xss = self.ss.transform(X)
Xscaled = pd.DataFrame(Xss, index=X.index, columns=X.columns)
return Xscaled
I have data that has both categorical and continuous features. Obviously the transformer will not transform the categorical feature ('sex'). When I fit this pipeline with the dataframe below it throws an error because it is trying to scale the categorical labels in 'sex':
sex length diameter height whole_weight shucked_weight \
0 M 0.455 0.365 0.095 0.5140 0.2245
1 M 0.350 0.265 0.090 0.2255 0.0995
2 F 0.530 0.420 0.135 0.6770 0.2565
3 M 0.440 0.365 0.125 0.5160 0.2155
4 I 0.330 0.255 0.080 0.2050 0.0895
5 I 0.425 0.300 0.095 0.3515 0.1410
How do I pass a list of categorical / continuous features into the transformer so it will scale the proper features? Or is it better to somehow code the feature type check inside the transformer?
Upvotes: 0
Views: 224
Reputation: 2019
Basically you need another step in the Pipeline with a similar class inheriting from BaseEstimator
and TransformerMixin
class ColumnSelector(BaseEstimator,TransformerMixin):
def __init__(self, columns: list):
self.cols = columns
def fit(self,X,y=None):
return self
def transform(self, X, y=None):
return X.loc[:, self.cols]
Then in your main the pipeline looks like this:
selector = ColumnSelector(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight'])
pipe = pipeline.make_pipeline(
selector,
DFStandardScaler()
)
pipe2 = pipeline.make_pipeline(#some steps for the sex column)
full_pipeline = pipeline.make_pipeline(
pipeline.make_union(
pipe,
pipe2
),
#some other step
)
Upvotes: 1