Reputation: 1415
Let say I have a vector of integers, where every integers corresponds to a category:
A = [1, 2, 2, 3, 3, 1, 2, 4, 4, 1]
I know how many categories I have. This vector is one of the columns of my X dataset which will end in the logistic regression model.
Is it possible to use the sciki-tlearn function onehotencoder to obtain something like:
0 0 0 1 (when 1)
0 0 1 0 (when 2)
0 1 0 0 (when 3)
1 0 0 0 (when whatever)
or even better
0 0 0
0 0 1
0 1 0
1 0 0
?
When I try to pass such a vector to onehotencoder I obtain this error: need more than 1 value to unpack
.
Furthermore: I suppose that if I have 'NULL' records I should first transform them in a number: is there a fast way to do it, like A(find(A=='NULL'))=123
?
Thank you for your help. Francesco
Upvotes: 2
Views: 2300
Reputation: 363597
OneHotEncoder
input needs to be 2-d, not 1-d (it expects a set of samples).
>>> X = [[1, 2, 2, 3, 3, 1, 2, 4, 4, 1]]
Let's suppose that your categorical features can all take on four values:
>>> n_values = np.repeat(4, len(X[0]))
>>> n_values
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
Then OneHotEncoder
works fine:
>>> oh = OneHotEncoder(n_values=n_values)
>>> Xt = oh.fit_transform(X)
>>> Xt.toarray()
array([[ 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0.,
0.]])
>>> Xt.shape
(1, 40)
It produces one dummy variable too many for each input variable, which is a bit wasteful. I've no idea what you mean by this NULL
stuff since I don't know what your data looks like. You might want to open a separate question for that.
Upvotes: 3