Reputation: 85
I am trying to use sklearn.preprocessing.LabelBinarizer()
to create a one hot encoding of only a two-column labels, i.e. I only want to categorize two set of objects. In this case, when I use fit(range(0,2))
, it just returns a one dimensional array, instead of 2x1. This is fine, but when I want to use them in Tensorflow
, the shape should really be (2,1) for dimensional consistency. Please advise how I can resolve it.
Here is the code:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(range(0, 3))
Calling lb.transform([1, 0])
, the result is:
[[0 1 0]
[1 0 0]]
whereas when we change 3 to 2, i.e. lb.fit(range(0, 2))
, the result would be
[[1]
[0]]
instead of
[[0 1]
[1 0]]
This will create problems in the algorithms that work consistently with arrays with n
dimensions. Is it any way to resolve this issue?
Upvotes: 3
Views: 2529
Reputation: 3635
As already said as a comment, this is not an issue of the method. According to the documentation: Binary targets transform to a column vector. You can build the array you want from the colomn vector result, in the case the dimension is 2.
A direct and simple way to do this is:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(range(2) # range(0, 2) is the same as range(2)
a = lb.transform([1, 0])
result_2d = np.array([[item[0], 0 if item[0] else 1] for item in a])
Upvotes: 1
Reputation: 16966
labelBinarizer()
's purpose according to the documentation is
Binarize labels in a one-vs-all fashion
Several regression and binary classification algorithms are available in scikit-learn. A simple way to extend these algorithms to the multi-class classification case is to use > the so-called one-vs-all scheme.
If your data has only two types of labels, then you can directly feed that to binary classifier. Hence, one column is good enough to capture two classes in One-Vs-Rest fashion.
Binary targets transform to a column vector
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit_transform(['yes', 'no', 'no', 'yes'])
array([[1],
[0],
[0],
[1]])
If your intention is just creating one-hot encoding, use the following method.
from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit_transform([['yes'], ['no'], ['no'], ['yes']]).toarray()
array([[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.]])
Hope this clarifies, your question of why Sklearn labelBinarizer()
does not convert the 2 class data into two column output.
Upvotes: 3