Reputation: 10578
My pipeline looks like this:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)
Which generates:
array([[0],
[1],
[1]])
However, when I apply my pipeline on unseen data:
test_animals = pd.DataFrame({'animal': ['cat', 'cat', 'duck', 'fish']})
lb.transform(test_animals)
It will spit out:
array([[1, 0],
[1, 0],
[0, 0],
[0, 0]])
Which breaks everything.
I need LabelBinarizer to ALWAYS onehotencode and never generate a single column. So:
lb = LabelBinarizer()
lb.fit_transform(train_animals.animal)
Will ideally generate:
array([[1, 0],
[0, 1],
[0, 1]])
Upvotes: 0
Views: 388
Reputation: 10578
I think I've come up with a solution that hacks the internal label_binarize
function and that works with DataFrameMapper
import pandas as pd
import numpy as np
from sklearn.preprocessing import label_binarize, LabelBinarizer
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
class SafeLabelBinarizer(TransformerMixin):
def __init__(self):
self.lb = LabelBinarizer()
def fit(self, X):
X = np.array(X)
self.lb.fit(X)
self.classes_ = self.lb.classes_
def transform(self, X):
K = np.append(self.classes_, ['__FAKE__'])
X = label_binarize(X, K, pos_label=1, neg_label=0)
X = np.delete(X, np.s_[-1], axis=1)
return X
def fit_transform(self, X):
self.fit(X)
return self.transform(X)
Training data:
train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
mapper = DataFrameMapper([
('animal', SafeLabelBinarizer())], df_out=True)
mapper.fit_transform(train_animals)
>>>
animal_cat animal_dog
0 1 0
1 0 1
2 0 1
Unseen data:
test_animals = pd.DataFrame({'animal': ['cat', 'cat', 'duck', 'fish']})
mapper.transform(test_animals)
>>>
animal_cat animal_dog
0 1 0
1 1 0
2 0 0
3 0 0
🎉
Upvotes: 1
Reputation: 36619
Its documented here that binary data will only contain 1 column.
Returns: Y : array or CSR matrix of shape [n_samples, n_classes]. Shape will be [n_samples, 1] for binary problems.
If you need one-column per category, you can try the following methods:
train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
pd.get_dummies(train_animals).values
array([[1, 0],
[0, 1],
[0, 1]])
But the caveat of this approach is that you need to transform the data before splitting into train and test. Not just on train data. Because on test data it will generate different number of columns.
from sklearn.preprocessing import CategoricalEncoder
enc = CategoricalEncoder()
train_animals = pd.DataFrame({'animal': ['cat', 'dog', 'dog']})
enc.fit_tranform(train_animals[['animals']])
array([[1, 0],
[0, 1],
[0, 1]])
Now, the CategoricalEncoder is still in development branch, so may not be easy to use.
3) Instead of CategoricalEncoder, you can use the combination of LabelEncoder and OneHotEncoder. See my other answer for more details on usage:
But for points 2 and 3, you need to make sure that all the possible values in the 'animals' column are present in train. If test set contains unseen values it will throw error, because the ML model cant do anything on test data which it hasn't seen.
Upvotes: 0