Laura
Laura

Reputation: 209

Encoding 32bit hex numbers using OneHotEncoding in sklearn

I have some categorical features hashed into 32bit hex numbers, for example, in one category ,the three different classes are hashed into:

'05db9164'  '68fd1e64' '8cf07265'

One Hot Encoding map these into a binary array, and only one bit is 1, the other is 0. So if I want to encoding the above features. Only need three bits.

001 correspond to 05db9164, 010 correspond to 68fd1e64, 100 correspond to 8cf07265

But when I use OneHotEncoder in sklearn, which tell me that the number is too large. this confused me. because we don't care the numerical property of the number. we only care about they are the same or not.

On the other hand, if i encoding 0,1,2:

enc = OneHotEncoder()
enc.fit([[0],[1],[2]])

print enc.transform([[0]]).toarray()
print enc.transform([[1]]).toarray()
print enc.transform([[2]]).toarray()

I have got the expected answer. And I think these 32bit hex number is used to indicate the class in the category. so it it the same as 0 , 1 ,2. and [0,0,1], [0,1,0],[1,0,0] is enough to encoding it. Could you please help me .thanks very much.

Upvotes: 1

Views: 748

Answers (1)

eickenberg
eickenberg

Reputation: 14377

If your array is not extremely long, you can rename the features using np.unique. That way you can also determine the maximal number of different features, which in return you can feed to the OneHotEncoder, so that it know how many columns to allocate. Note that the renaming is not per se necessary, but it has the nice side effect of generating integers which use less space (if you use np.int32).

import numpy as np
rng = np.random.RandomState(42)
# generate some data
data = np.array(['05db9164', '68fd1e64', '8cf07265'])[rng.randint(0, 3, 100)]

uniques, new_labels = np.unique(data, return_inverse=True)
n_values = len(uniques)

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(n_values=n_values)
encoded = encoder.fit_transform(new_labels[:, np.newaxis])

print repr(encoded)

Upvotes: 1

Related Questions