Reputation: 29
This is my problem with sklearn's OneHotEncoder.
with an array a = [1,2,3,4,5,6,7,8,9,22]
i.e ALL UNIQUE of a.shape=[10,1]
(after reshape(-1,1)
, a [10,10] matrix of OneHotEncoded values is returned.
array([[ 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])
But with an array like a = [1,2,2,4,4,6,7,8,9,22]
i.e NON UNIQUE of a.shape=[10,1]
(after reshape(-1,1)
, a [10,8] matrix of OneHotEncoded values is returned.
array([[ 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1.]])
But I cannot use this as my input placeholder expects a [10,10] matrix as input. Can anyone help me handle non-unique values in sklearn's OneHotEncoder?
P.S Adding the parameter n_values= 10 gives an error saying ValueError: Feature out of bounds for n_values=10
Upvotes: 0
Views: 168
Reputation: 83
Do you know all the values your categorical feature can take? If so, you can do something like this:
enc = OneHotEncoder()
enc.fit(np.asarray([1,2,3,4,5,6,7,8,9,22]).reshape(-1, 1)) #fit your encoder to the values
data_for_encoding = np.asarray([1,2,2,4,4,6,7,8,9,22]).reshape(-1, 1) #your data
sparse_matrix = enc.transform(data_for_encoding) #encoded data
Upvotes: 1