Emil
Emil

Reputation: 1722

Most efficient one-hot-encoder

Say, I have a Numpy array target that looks as follows:

target = np.array([1, 2, 3, 2, 3, 2, 3, 1, 1, 3])

I know the range of the values in target: namely 1-3.

Now, I want to create a one hot encoding of target for which the length is the same as target.

To do so, I have used the following code:

target_one_hot = np.zeros([len(target], 4)

for i in range(0, len(target)):
    target_one_hot[i, target[i]] = 1

target_one_hot = = np.delete(target_one_hot , 0, 1)

This works. However, I suspect that this operation can be written more efficiently by omitting the for-loop. How can I do this?

Upvotes: 0

Views: 528

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150785

There's a OneHotEncoder for that:

from sklearn.preprocessing import OneHotEncoder

a = OneHotEncoder().fit_transform(target.reshape(-1,1))

Your One-hot matrix will be a sparse matrix, you can get the numpy array with:

a.toarray()

On the other hand, if you already know the range:

np.array(np.arange(1,4)[:,None]==target, dtype=np.int)
# 4.23 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

Divakar
Divakar

Reputation: 221614

Approach #1

Create a mask (for memory + perf. efficiency), assign 1s/True at indices given by target (one-offsetted as they start with 1) and finally use view for conversion to int array -

mask = np.zeros((len(target), 3), dtype=bool)
mask[np.arange(len(target)),target-1] = 1
out = mask.view('i1')

If the final output is required as floats, initialize mask with float dtype at the start and skip the final int conversion.

Approach #2

Another with hashing by indexing on identity-matrix with offsetted target -

np.eye(3, dtype=bool)[target-1].view('i1')

Approach #3

Hashing directly with target -

np.eye(4, k=-1, dtype=bool)[target,:-1].view('i1')

Timings on a large dataset -

In [46]: target = np.random.randint(1,4,1000000)

In [47]: %%timeit
    ...: mask = np.zeros((len(target), 3), dtype=bool)
    ...: mask[np.arange(len(target)),target-1] = 1
10.3 ms ± 48.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [48]: %timeit np.eye(3, dtype=bool)[target-1]
14.3 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [49]: %timeit np.eye(4, k=-1, dtype=bool)[target]
13.1 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 2

Related Questions