JJson
JJson

Reputation: 233

Generate one hot encodings from dict values

I was trying to make a one hot array based on my dictionary characters: First, I created a numpy zeros that has row X column (3x7) and then I search for the id of each character and assign "1" to each row of the numpy array.

My goal is to assign each character with one hot array. "1" as "present" and "0" as "not present". Here we have 3 characters so we should have 3 rows, while the 7 columns serve as the characters existence in the dictionary.

However, I received an error stating that "TypeError: only integer scalar arrays can be converted to a scalar index". Can anyone please help me in this? Thank you

In order not to make everyone misunderstand my dictionary:

Here is how I create the dic:

sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}

My code:

import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)

for x,y in a.items():
    aa = np.zeros((aa,aaa))
    aa[y] = 1

print(aa)

Current Error:

TypeError: only integer scalar arrays can be converted to a scalar index

My expected output:

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

-------> Since its dictionary so the index arrangement should be different and the "1"s within the array is a dummy so that I can show my expected output.

Upvotes: 1

Views: 6630

Answers (5)

Him
Him

Reputation: 5551

I like to use a LabelEncoder with a OneHotEncoder from sklearn.

import sklearn.preprocessing
import numpy as np

texty_data = np.array(["a", "c", "b"])
le = sklearn.preprocessing.LabelEncoder().fit(texty_data)
integery_data = le.transform(texty_data)
ohe = sklearn.preprocessing.OneHotEncoder().fit(integery_data.reshape((-1,1)))
onehot_data = ohe.transform(integery_data.reshape((-1,1)))

Stores it sparse, so that's handy. You can also use a LabelBinarizer to streamline this:

import sklearn.preprocessing
import numpy as np

texty_data = np.array(["a", "c", "b"])
lb = sklearn.preprocessing.LabelBinarizer().fit(texty_data)
onehot_data = lb.transform(texty_data)
print(onehot_data, lb.inverse_transform(onehot_data))

Upvotes: 0

JJson
JJson

Reputation: 233

Here is another one by using sklearn.preprocessing

The lines are quite long and not much difference. I don:t know why but produced a similar results.

import numpy as np
from sklearn.preprocessing import OneHotEncoder
sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}


sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"a":0, "b":1, "c":2, "d":3, "e":4, "f":5, "g":6}
aa =len(a)

index = []
for x,y in a.items():
    index.append([y])

index = np.asarray(index)

enc = OneHotEncoder()
enc.fit(index)

print(enc.transform([[1], [2], [4]]).toarray())

Output

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

Upvotes: 1

JJson
JJson

Reputation: 233

My solution and for future readers:

I build the dictionary for the "sent" list:

sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}

Then I find the indices for my own sentences based on the dictionary and assigned the numerical values to these sentences.

import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)

I extract the indices from the new assignment of "a":

index = []
for x,y in a.items():
    index.append(y)

Then I create another numpy array for these extract indices from the a.

index = np.asarray(index)

Now I create numpy zeros to store the existence of each character:

new = np.zeros((aa,aaa))
new[np.arange(aa), index] = 1

print(new)

Output:

[[0. 1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

Upvotes: 1

Mark
Mark

Reputation: 92450

A one hot encoding treats a sample as a sequence, where each element of the sequence is the index into a vocabulary indicating whether that element (like a word or letter) is in the sample. For example if your vocabulary was the lower-case alphabet, a one-hot encoding of the work cat might look like:

 [1, 0., 1, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 1, 0., 0., 0., 0., 0., 0.]

Indicating that this word contains the letters c, a, and t.

To make a one-hot encoding you need two things a vocabulary lookup with all the possible values (when using words this is why the matrices can get so large because the vocabulary is huge!). But if encoding the lower-case alphabet you only need 26.

Then you typically represent your samples as indexes in the vocabulary. So the set of words might look like this:

#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])

When you one-hot encode that you will get a matrix 3 x 26:

vocab = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])

def onHot(sequences, dimension=len(vocab)):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
      results[i, sequence] = 1
    return results

onHot(sentences)

Which results in thee one-hot encoded samples with a 26 letter vocabulary ready to be fed to a neural network:

array([[1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Upvotes: 2

cs95
cs95

Reputation: 402813

Setting indices

(Comments inlined.)

# Sort and extract the indices.
idx = sorted(a.values())
# Initialise a matrix of zeros.
aa = np.zeros((len(idx), max(idx) + 1))
# Assign 1 to appropriate indices.
aa[np.arange(len(aa)), idx] = 1

print (aa)
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

numpy.eye

idx = sorted(a.values())
eye = np.eye(max(idx) + 1)    
aa = eye[idx]

print (aa)
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

Upvotes: 6

Related Questions