Accessing column names of a pandas dataframe within a custom transformer in a Sklearn pipeline with ColumnTransformer?

I need to use a custom transformer within a pipeline that acts using the column names. However, the previous pipeline transformations convert the dataframe to a numpy array. I know I can retrieve the column names from the Column Transformer object after the pipeline has been fit, but I need to access the column names within the fit step. The custom transformer in my example below is a simple minimal example for illustration only, not the true transformation.

import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin


class MyCustomTransformer(BaseEstimator, TransformerMixin):
    def my_custom_transformation(self, X):
        """
        Parameters
        ----------
        X: pandas dataframe
        """
        columns_to_keep = [col for col in X.columns if col.endswith(('_a', '_b'))]
        return columns_to_keep
    
    def fit(self, X, y=None):
        self.columns_to_keep = self.my_custom_transformation(X)
        return self

    def transform(self, X, y=None):
        return X[self.columns]

numeric_transformer = Pipeline(steps=[('minmax_scaler', MinMaxScaler())])
categorical_transformer = Pipeline(steps=[('onehot_encoder', OneHotEncoder(sparse=False))])

column_transformer = ColumnTransformer(transformers=[
    ('numeric_transformer', numeric_transformer, ['num']),
    ('categorical_transformer', categorical_transformer, ['cat']),
])

pipeline = Pipeline(steps=[
    ('column_transformer', column_transformer),
    ('my_custom_transformer', MyCustomTransformer())
])

df = pd.DataFrame(data={'num': [1,2,3], 'cat':['a', 'b', 'c']})
pipeline.fit(data_df)

which would ideally result as:

transformed_df = pipeline.transform(df)
print(transformed_df)
>>>    num    cat_a    cat_b
    0    0        1        0
    1  0.5        0        1
    2    1        0        0

The transformations in the column_transformer convert the dataframe to a numpy array, which is then passed to the custom transformer. Obviously this results in an error since you can't get the column names from a numpy array.

I can't use indexing to access the columns since the one-hot encoding can result in an not-previously-known number of columns.

If I could access the ColumnTransformer object within the fit method of the custom transformer, I could retrieve the column names, then create a pandas dataframe to use in the fit method as above (?), but I have not successfully found a way to do this.

Any help would be much appreciated.

Upvotes: 5

Answers (2)

thibaultbl

Reputation: 984

pip install sklearn-pandas-transformers

from sklearn_pandas_transformers.transformers import SklearnPandasWrapper

column_transformer = ColumnTransformer(transformers=[
    ('numeric_transformer', SklearnPandasWrapper(numeric_transformer), ['num']),
    ('categorical_transformer', SklearnPandasWrapper(categorical_transformer), ['cat']),
])

Upvotes: 0

Delphine

Reputation: 215

See my proposed implementation of a ColumnTransformerWithNames in response to how do i get_feature_names using a column transformer. You can replace the calls to ColumnTransformer with ColumnTransformerWithNames and the output of the pipeline will be a DataFrame with column names =)

Upvotes: 0

Accessing column names of a pandas dataframe within a custom transformer in a Sklearn pipeline with ColumnTransformer?

Answers (2)

Related Questions