D3181
D3181

Reputation: 2102

One hot encoding from numpy

I am trying to understand values output from an example python tutorial. The output doesent seem to be in any order that I can understand. The particular python lines are causing me trouble :

vocab_size = 13   #just to provide all variable values
m = 84 #just to provide all variable values
Y_one_hot = np.zeros((vocab_size, m))
Y_one_hot[Y.flatten(), np.arange(m)] = 1

The input Y.flatten() is evaluated as the following numpy-array :

  [ 8  9  7  4  9  7  8  4  8  7  8 12  4  8  9  8 12  7  8  9  7 12  7  2
  9  7  8  7  2  0  7  8 12  2  0  8  8 12  7  0  8  6 12  7  2  8  6  5
  7  2  0  6  5 10  2  0  8  5 10  1  0  8  6 10  1  3  8  6  5  1  3 11
  6  5 10  3 11  5 10  1 11 10  1  3]

np arrange is a tensor ranging from 0-83

np.arange(m)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83]

Ok so the output that I am having trouble understanding from the new Y_one_hot is that I recieve a numpy array of size 13 (as expected) but I do not understand why the positions of the ones are located where they are located based on the Y.flatten() input for example here is the first array of the 13:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0]

Could someone please explain how I got from that input value to that output array from that single line? It seems like the ones are in random positions and in some other arrays of the 13 the number of ones also seems to be random. Is this the intended behavior?

here is a full runnable example:

import numpy as np
import sys
import re



# turn Y into one hot encoding
Y =  np.array([ 8,  9,  7,  4 , 9,  7,  8,  4,  8,  7,  8, 12,  4,  8,  9,  8, 12,  7,  8,  9,  7, 12,  7,  2,
  9,  7,  8,  7,  2,  0,  7,  8, 12,  2,  0,  8,  8, 12,  7,  0,  8,  6, 12,  7,  2,  8,  6,  5,
  7,  2,  0,  6,  5, 10,  2,  0,  8,  5, 10,  1,  0,  8,  6, 10,  1,  3,  8,  6,  5,  1,  3, 11,
  6,  5, 10,  3, 11,  5, 10,  1, 11, 10,  1,  3])
m = 84
vocab_size = 13
Y_one_hot = np.zeros((vocab_size, m))
Y_one_hot[Y.flatten(), np.arange(m)] = 1
np.set_printoptions(threshold=sys.maxsize)
print(Y_one_hot.astype(int))

Upvotes: 5

Views: 5265

Answers (2)

Ivan
Ivan

Reputation: 40768

The code you showed is a quick way to convert multiple label indices to one-hot-encodings.

Let's do it with a single index, and convert it to a one-hot-encoding vector. To keep it simple, we will stick with an encoding size of 10 (i.e. nine 0s and one 0):

>>> y = 4
>>> y_ohe = np.zeros(10)
>>> y_ohe[y] = 1
array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])

Now, let's try with more than one index: 5 labels at the same time. The starting array would be two-dimensional: (5, 10), i.e. a one-hot-encoding vector of size 10 per label.

>>> y = np.array([4, 2, 1, 7])
>>> y_ohe = np.zeros((4, 10))
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

The desired result is:

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 7., 0., 0.]])

To do so we will index by row and by column: np.arange(len(y)) will give us all rows indices, while y will give us the columns where the 1 are supposed to be. Since np.arange(len(y)) and y have the same length, they will be iterated over zipped, something like

>>> for i, j in zip(np.arange(len(y)), y):
>>>     print(i, j)
[0, 4]
[1, 2]
[2, 1]
[3, 7]

These are the [i, j] coordinates in the 2D tensor y_ohe where we want 1s to be.

Assign the indexed value to 1s:

>>> y_ohe[np.arange(len(y)), y] = 1
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])

Similarly, by indexing the other way around:

>>> y = np.array([4, 2, 1, 7])
>>> y_ohe = np.zeros((10, 4))
>>> y_ohe[y, np.arange(len(y))] = 1
array([[0., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In your case Y had an extra dimension, something like Y = np.array([[4], [2], [1], [7]]) to relate to the example I gave above. Which would give y after being flattened.

Upvotes: 3

jakevdp
jakevdp

Reputation: 86513

The line Y_one_hot[Y.flatten(), np.arange(m)] = 1 is setting values of the array with lists of integer indices (Documented at Integer Array Indexing)

The arrays of indices are broadcast together, and the result for 1D arrays is essentially an efficient way to do this:

for i, j in zip(Y.flatten(), np.arange(m)):
    Y_one_hot[i, j] = 1

In words, each column of Y_one_hot corresponds to an entry of Y.flatten(), and has a single nonzero value in the row given by the entry.

It may be easier to see with a smaller array:

Y_onehot = np.zeros((2, 3), dtype=int)
Y = np.array([0, 1, 0])

Y_onehot[Y.flatten(), np.arange(3)] = 1

print(Y_onehot)
# [[1 0 1]
#  [0 1 0]]

Three entries map to three columns, and each column has a single nonzero entry in the row corresponding to the value.

Upvotes: 2

Related Questions