emehex
emehex

Reputation: 10578

Inconsistent LabelBinarizer Behaviour breaks Pipeline

My pipeline looks like this:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer

train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})

lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)

Which generates:

array([[0],
       [1],
       [1]])

However, when I apply my pipeline on unseen data:

test_animals = pd.DataFrame({'animal': ['cat', 'cat', 'duck', 'fish']})
lb.transform(test_animals)

It will spit out:

array([[1, 0],
       [1, 0],
       [0, 0],
       [0, 0]])

Which breaks everything.

I need LabelBinarizer to ALWAYS onehotencode and never generate a single column. So:

lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)

Will ideally generate:

array([[1, 0],
       [0, 1],
       [0, 1]])

Upvotes: 0

Views: 388

Answers (2)

emehex
emehex

Reputation: 10578

I think I've come up with a solution that hacks the internal label_binarize function and that works with DataFrameMapper

import pandas as pd
import numpy as np
from sklearn.preprocessing import label_binarize, LabelBinarizer
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper

class SafeLabelBinarizer(TransformerMixin):

    def __init__(self):
        self.lb = LabelBinarizer()

    def fit(self, X):
        X = np.array(X)
        self.lb.fit(X)
        self.classes_ = self.lb.classes_

    def transform(self, X):
        K = np.append(self.classes_, ['__FAKE__'])
        X = label_binarize(X, K, pos_label=1, neg_label=0)
        X = np.delete(X, np.s_[-1], axis=1)
        return X

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

Training data:

train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})

mapper = DataFrameMapper([
    ('animal', SafeLabelBinarizer())], df_out=True)

mapper.fit_transform(train_animals)

>>>

    animal_cat  animal_dog
0   1   0
1   0   1
2   0   1

Unseen data:

test_animals = pd.DataFrame({'animal': ['cat', 'cat', 'duck', 'fish']})
mapper.transform(test_animals)

>>>

    animal_cat  animal_dog
0   1   0
1   1   0
2   0   0
3   0   0

🎉

Upvotes: 1

Vivek Kumar
Vivek Kumar

Reputation: 36619

Its documented here that binary data will only contain 1 column.

Returns: Y : array or CSR matrix of shape [n_samples, n_classes]. Shape will be [n_samples, 1] for binary problems.

If you need one-column per category, you can try the following methods:

1) pd.get_dummies()

train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
pd.get_dummies(train_animals).values

array([[1, 0],
       [0, 1],
       [0, 1]])

But the caveat of this approach is that you need to transform the data before splitting into train and test. Not just on train data. Because on test data it will generate different number of columns.

2) CategoricalEncoder()

from sklearn.preprocessing import CategoricalEncoder
enc = CategoricalEncoder()
train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
enc.fit_tranform(train_animals[['animals']])

array([[1, 0],
       [0, 1],
       [0, 1]])

Now, the CategoricalEncoder is still in development branch, so may not be easy to use.

3) Instead of CategoricalEncoder, you can use the combination of LabelEncoder and OneHotEncoder. See my other answer for more details on usage:

But for points 2 and 3, you need to make sure that all the possible values in the 'animals' column are present in train. If test set contains unseen values it will throw error, because the ML model cant do anything on test data which it hasn't seen.

Upvotes: 0

Related Questions