Shishir Narayan
Shishir Narayan

Reputation: 29

Python sklearn OneHotEncoding categorical and sometimes repeated values

This is my problem with sklearn's OneHotEncoder. with an array a = [1,2,3,4,5,6,7,8,9,22] i.e ALL UNIQUE of a.shape=[10,1] (after reshape(-1,1), a [10,10] matrix of OneHotEncoded values is returned.

array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
   [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
   [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
   [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]])

But with an array like a = [1,2,2,4,4,6,7,8,9,22] i.e NON UNIQUE of a.shape=[10,1] (after reshape(-1,1), a [10,8] matrix of OneHotEncoded values is returned.

array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
   [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
   [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

But I cannot use this as my input placeholder expects a [10,10] matrix as input. Can anyone help me handle non-unique values in sklearn's OneHotEncoder?

P.S Adding the parameter n_values= 10 gives an error saying ValueError: Feature out of bounds for n_values=10

Upvotes: 0

Views: 168

Answers (1)

oleg
oleg

Reputation: 83

Do you know all the values your categorical feature can take? If so, you can do something like this:

enc = OneHotEncoder()   
enc.fit(np.asarray([1,2,3,4,5,6,7,8,9,22]).reshape(-1, 1)) #fit your encoder to the values
data_for_encoding =  np.asarray([1,2,2,4,4,6,7,8,9,22]).reshape(-1, 1) #your data
sparse_matrix = enc.transform(data_for_encoding) #encoded data

Upvotes: 1

Related Questions