Transforming text data in sklearn pipeline

Question

Given an array of text data,

X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])

I would like to use an sklearn pipeline to produce output like

np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 0, 1], [0, 1, 0]])

My initial attempt

pipe = Pipeline([
    ('encoder', LabelEncoder()),
    ('hot', OneHotEncoder(sparse=False))])
print(pipe.fit_transform(X))

raises TypeError: fit_transform() takes exactly 2 arguments (3 given), as per this issue. I have tried editing the signature on LabelEncoder, so that SaneLabelEncoder().fit_transform(X) gives [0 2 1 0 1 2], but then

pipe = Pipeline([
    ('encoder', SaneLabelEncoder()),
    ('hot', OneHotEncoder(sparse=False))])
print(pipe.fit_transform(X))

gives [[ 1. 1. 1. 1. 1. 1.]]. Any suggestions on getting to the desired output?

Ryan · Accepted Answer

Use LabelBinarizer:

import numpy as np                                  
from sklearn import preprocessing                                                                                                                            
X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])                                                                                                                      
binar = preprocessing.LabelBinarizer()                                                                                                                                        
X_bin = binar.fit_transform(X)                                                                                                                                                
print X_bin

the output is:

[[1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 0]
 [0 0 1]]

Transforming text data in sklearn pipeline

Answers (2)

Related Questions