Mike Palei
Mike Palei

Reputation: 81

sklearn ColumnTransformer with MultilabelBinarizer

I wonder if it is possible to use a MultilabelBinarizer within a ColumnTransformer.

I have a toy pandas dataframe like:

df = pd.DataFrame({"id":[1,2,3], 
"text": ["some text", "some other text", "yet another text"], 
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]]})

preprocess = ColumnTransformer(
    [
     ('vectorizer', CountVectorizer(), 'text'),
    ('binarizer', MultiLabelBinarizer(), ['label']),

    ],
    remainder='drop')

this code, however, throws an exception:

~/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    714     with _print_elapsed_time(message_clsname, message):
    715         if hasattr(transformer, 'fit_transform'):
--> 716             res = transformer.fit_transform(X, y, **fit_params)
    717         else:
    718             res = transformer.fit(X, y, **fit_params).transform(X)

TypeError: fit_transform() takes 2 positional arguments but 3 were given

With OneHotEncoder the ColumnTransformer does work.

Upvotes: 8

Views: 2540

Answers (3)

Adib
Adib

Reputation: 1334

I made modifications to @ji.xu's answer by including two important amendments:

  • Ability to pass the full dataframe to record all categories

  • Ability to get all the feature names via estimatimator.get_feature_names()

In your ColumnTransformer, you should initialize it as such...

        # Function for array hot encoding
        (
            "array_one_hot_encode",
            MultiHotEncoder(df=df),
            ["Col_A","Col_B"]
        )

Hope this answer helps others who want to build custom transformers!

from sklearn.base import BaseEstimator, TransformerMixin

class MultiHotEncoder(BaseEstimator, TransformerMixin):
    """Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`. Note
    that input X has to be a `pandas.DataFrame`.

    Requires the non-training DataFrame to ensure it collects all labels so it won't be lost in train-test-split

    To initialize, you musth pass the full DataFrame and not 
    the df_train or df_test to guarantee that you captured all categories.
    Otherwise, you'll receive a user error with regards to missing/unknown categories.
    """
    def __init__(self, df:pd.DataFrame):
        self.mlbs = list()
        self.n_columns = 0
        self.categories_ = self.classes_ = list()
        self.df = df
    
    def fit(self, X:pd.DataFrame, y=None):
        
        # Collect columns
        self.columns = X.columns.to_list()

        # Loop through columns
        for i in range(X.shape[1]): # X can be of multiple columns
            mlb = MultiLabelBinarizer()
            mlb.fit(self.df[self.columns].iloc[:,i])
            self.mlbs.append(mlb)
            self.classes_.append(mlb.classes_)
            self.n_columns += 1

        self.categories_ = self.classes_

        return self

    def transform(self, X:pd.DataFrame):
        if self.n_columns == 0:
            raise ValueError('Please fit the transformer first.')
        if self.n_columns != X.shape[1]:
            raise ValueError(f'The fit transformer deals with {self.n_columns} columns '
                             f'while the input has {X.shape[1]}.'
                            )
        result = list()
        for i in range(self.n_columns):
            result.append(self.mlbs[i].transform(X.iloc[:,i]))

        result = np.concatenate(result, axis=1)
        return result

    def fit_transform(self, X:pd.DataFrame, y=None):
        return self.fit(X).transform(X)

    def get_feature_names_out(self, input_features=None):
        cats = self.categories_
        if input_features is None:
            input_features = self.columns
        elif len(input_features) != len(self.categories_):
            raise ValueError(
                "input_features should have length equal to number of "
                "features ({}), got {}".format(len(self.categories_),
                                               len(input_features)))

        feature_names = []
        for i in range(len(cats)):
            names = [input_features[i] + "_" + str(t) for t in cats[i]]
            feature_names.extend(names)

        return np.asarray(feature_names, dtype=object)

Upvotes: 0

ji.xu
ji.xu

Reputation: 475

For input X, MultiLabelBinarizer is suited to deal with one column at a time (as each row is supposed to be a sequence of categories), while OneHotEncoder can deal with multiple columns. To make a ColumnTransformer compatible MultiHotEncoder, you will need to iterate through all columns of X and fit/transform each column with a MultiLabelBinarizer. The following should work with pandas.DataFrame input.

from sklearn.base import BaseEstimator, TransformerMixin

class MultiHotEncoder(BaseEstimator, TransformerMixin):
    """Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`. Note
    that input X has to be a `pandas.DataFrame`.
    """
    def __init__(self):
        self.mlbs = list()
        self.n_columns = 0
        self.categories_ = self.classes_ = list()

    def fit(self, X:pd.DataFrame, y=None):
        for i in range(X.shape[1]): # X can be of multiple columns
            mlb = MultiLabelBinarizer()
            mlb.fit(X.iloc[:,i])
            self.mlbs.append(mlb)
            self.classes_.append(mlb.classes_)
            self.n_columns += 1
        return self

    def transform(self, X:pd.DataFrame):
        if self.n_columns == 0:
            raise ValueError('Please fit the transformer first.')
        if self.n_columns != X.shape[1]:
            raise ValueError(f'The fit transformer deals with {self.n_columns} columns '
                             f'while the input has {X.shape[1]}.'
                            )
        result = list()
        for i in range(self.n_columns):
            result.append(self.mlbs[i].transform(X.iloc[:,i]))

        result = np.concatenate(result, axis=1)
        return result

# test
temp = pd.DataFrame({
    "id":[1,2,3], 
    "text": ["some text", "some other text", "yet another text"], 
    "label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]],
    "label2": [["w", "c"], ["b", "c"], ["b", "d"]]
})

col_transformer = ColumnTransformer([
    ('one-hot', OneHotEncoder(), ['id','text']),
    ('multi-hot', MultiHotEncoder(), ['label', 'label2'])
])
col_transformer.fit_transform(temp)

and you should get:

array([[1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.]])

Note how the first 3 and second 3 columns are one-hot coded while the following 5 and last 4 are multi-hot coded. And the categories info can be found as you normally do:

col_transformer.named_transformers_['one-hot'].categories_

>>> [array([1, 2, 3], dtype=object),
     array(['some other text', 'some text', 'yet another text'], dtype=object)]

col_transformer.named_transformers_['multi-hot'].categories_

>>> [array(['black', 'brown', 'cat', 'dog', 'white'], dtype=object),
     array(['b', 'c', 'd', 'w'], dtype=object)]

Upvotes: 5

mcapizzi
mcapizzi

Reputation: 95

I wasn't particularly diligent in my testing to know exactly why the below works, but I was able to build a custom <Transformer> that essentially "wraps" the MultiLabelBinarizer but is also compatible with <ColumnTransformer>:

class MultiLabelBinarizerFixedTransformer(BaseEstimator, TransformerMixin):
    """       
    Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`
    """
    def __init__(
            self 
        ):
        self.feature_name = ["mlb"]
        self.mlb = MultiLabelBinarizer(sparse_output=False)

    def fit(self, X, y=None):
        self.mlb.fit(X)
        return self

    def transform(self, X):
        return self.mlb.transform(X)

    def get_feature_names(self, input_features=None):
        cats = self.mlb.classes_
        if input_features is None:
            input_features = ['x%d' % i for i in range(len(cats))]
            print(input_features)
        elif len(input_features) != len(self.categories_):
            raise ValueError(
                "input_features should have length equal to number of "
                "features ({}), got {}".format(len(self.categories_),
                                               len(input_features)))

        feature_names = [f"{input_features[i]}_{cats[i]}" for i in range(len(cats))]
        return np.array(feature_names, dtype=object)

My hunch is that MultiLabelBinarizer uses a different set of inputs for transform() than the <ColumnTransformer> expects.

Upvotes: 2

Related Questions