Windstorm1981
Windstorm1981

Reputation: 2680

Python Pipeline Custom Transformer

I am trying to code a custom transformer to be used in a pipeline to pre-process data.

Here is the code I'm using (sourced - not written by me). It takes in a dataframe, scales the features, and returns a dataframe:

class DFStandardScaler(BaseEstimator,TransformerMixin):

    def __init__(self):

        self.ss = None

    def fit(self,X,y=None):

        self.ss = StandardScaler().fit(X)
        return self

    def transform(self, X):

        Xss = self.ss.transform(X)
        Xscaled = pd.DataFrame(Xss, index=X.index, columns=X.columns)
        return Xscaled

I have data that has both categorical and continuous features. Obviously the transformer will not transform the categorical feature ('sex'). When I fit this pipeline with the dataframe below it throws an error because it is trying to scale the categorical labels in 'sex':

     sex  length  diameter  height  whole_weight  shucked_weight  \
0      M   0.455     0.365   0.095        0.5140          0.2245   
1      M   0.350     0.265   0.090        0.2255          0.0995   
2      F   0.530     0.420   0.135        0.6770          0.2565   
3      M   0.440     0.365   0.125        0.5160          0.2155   
4      I   0.330     0.255   0.080        0.2050          0.0895   
5      I   0.425     0.300   0.095        0.3515          0.1410   

How do I pass a list of categorical / continuous features into the transformer so it will scale the proper features? Or is it better to somehow code the feature type check inside the transformer?

Upvotes: 0

Views: 224

Answers (1)

nickyfot
nickyfot

Reputation: 2019

Basically you need another step in the Pipeline with a similar class inheriting from BaseEstimator and TransformerMixin

class ColumnSelector(BaseEstimator,TransformerMixin):
    def __init__(self, columns: list):
        self.cols = columns

    def fit(self,X,y=None):
        return self

    def transform(self, X, y=None):
        return X.loc[:, self.cols]

Then in your main the pipeline looks like this:

selector = ColumnSelector(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight'])
pipe = pipeline.make_pipeline(
    selector,
    DFStandardScaler()
)

pipe2 = pipeline.make_pipeline(#some steps for the sex column)

full_pipeline = pipeline.make_pipeline(
    pipeline.make_union(
        pipe,
        pipe2
    ),
    #some other step
)

Upvotes: 1

Related Questions