Reputation: 3682
Given an array of text data,
X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])
I would like to use an sklearn pipeline to produce output like
np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 0, 1], [0, 1, 0]])
My initial attempt
pipe = Pipeline([
('encoder', LabelEncoder()),
('hot', OneHotEncoder(sparse=False))])
print(pipe.fit_transform(X))
raises TypeError: fit_transform() takes exactly 2 arguments (3 given)
, as per this issue. I have tried editing the signature on LabelEncoder, so that SaneLabelEncoder().fit_transform(X)
gives [0 2 1 0 1 2]
, but then
pipe = Pipeline([
('encoder', SaneLabelEncoder()),
('hot', OneHotEncoder(sparse=False))])
print(pipe.fit_transform(X))
gives [[ 1. 1. 1. 1. 1. 1.]]
. Any suggestions on getting to the desired output?
Upvotes: 1
Views: 759
Reputation: 10302
pandas has a method get_dummies
for this:
pd.get_dummies(X)
Will produce DataFrame:
cat cow dog
0 1 0 0
1 0 0 1
2 0 1 0
3 1 0 0
4 0 1 0
5 0 0 1
Or if you must have an array of ints:
pd.get_dummies(X).values.astype(int)
Will yield:
[[1 0 0]
[0 0 1]
[0 1 0]
[1 0 0]
[0 1 0]
[0 0 1]]
Upvotes: 1
Reputation: 3709
Use LabelBinarizer
:
import numpy as np
from sklearn import preprocessing
X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])
binar = preprocessing.LabelBinarizer()
X_bin = binar.fit_transform(X)
print X_bin
the output is:
[[1 0 0]
[0 0 1]
[0 1 0]
[1 0 0]
[0 1 0]
[0 0 1]]
Upvotes: 3