Reputation: 233
I was trying to make a one hot array based on my dictionary characters: First, I created a numpy zeros that has row X column (3x7) and then I search for the id of each character and assign "1" to each row of the numpy array.
My goal is to assign each character with one hot array. "1" as "present" and "0" as "not present". Here we have 3 characters so we should have 3 rows, while the 7 columns serve as the characters existence in the dictionary.
However, I received an error stating that "TypeError: only integer scalar arrays can be converted to a scalar index". Can anyone please help me in this? Thank you
In order not to make everyone misunderstand my dictionary:
Here is how I create the dic:
sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}
My code:
import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)
for x,y in a.items():
aa = np.zeros((aa,aaa))
aa[y] = 1
print(aa)
Current Error:
TypeError: only integer scalar arrays can be converted to a scalar index
My expected output:
[[0. 1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0.]]
-------> Since its dictionary so the index arrangement should be different and the "1"s within the array is a dummy so that I can show my expected output.
Upvotes: 1
Views: 6630
Reputation: 5551
I like to use a LabelEncoder
with a OneHotEncoder
from sklearn
.
import sklearn.preprocessing
import numpy as np
texty_data = np.array(["a", "c", "b"])
le = sklearn.preprocessing.LabelEncoder().fit(texty_data)
integery_data = le.transform(texty_data)
ohe = sklearn.preprocessing.OneHotEncoder().fit(integery_data.reshape((-1,1)))
onehot_data = ohe.transform(integery_data.reshape((-1,1)))
Stores it sparse, so that's handy. You can also use a LabelBinarizer
to streamline this:
import sklearn.preprocessing
import numpy as np
texty_data = np.array(["a", "c", "b"])
lb = sklearn.preprocessing.LabelBinarizer().fit(texty_data)
onehot_data = lb.transform(texty_data)
print(onehot_data, lb.inverse_transform(onehot_data))
Upvotes: 0
Reputation: 233
Here is another one by using sklearn.preprocessing
The lines are quite long and not much difference. I don:t know why but produced a similar results.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
a[xx] = aa[xx]
a = {"a":0, "b":1, "c":2, "d":3, "e":4, "f":5, "g":6}
aa =len(a)
index = []
for x,y in a.items():
index.append([y])
index = np.asarray(index)
enc = OneHotEncoder()
enc.fit(index)
print(enc.transform([[1], [2], [4]]).toarray())
Output
[[0. 1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0.]]
Upvotes: 1
Reputation: 233
My solution and for future readers:
I build the dictionary for the "sent" list:
sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}
Then I find the indices for my own sentences based on the dictionary and assigned the numerical values to these sentences.
import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)
I extract the indices from the new assignment of "a":
index = []
for x,y in a.items():
index.append(y)
Then I create another numpy array for these extract indices from the a.
index = np.asarray(index)
Now I create numpy zeros to store the existence of each character:
new = np.zeros((aa,aaa))
new[np.arange(aa), index] = 1
print(new)
Output:
[[0. 1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0.]]
Upvotes: 1
Reputation: 92450
A one hot encoding treats a sample as a sequence, where each element of the sequence is the index into a vocabulary indicating whether that element (like a word or letter) is in the sample. For example if your vocabulary was the lower-case alphabet, a one-hot encoding of the work cat might look like:
[1, 0., 1, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 1, 0., 0., 0., 0., 0., 0.]
Indicating that this word contains the letters c
, a
, and t
.
To make a one-hot encoding you need two things a vocabulary lookup with all the possible values (when using words this is why the matrices can get so large because the vocabulary is huge!). But if encoding the lower-case alphabet you only need 26.
Then you typically represent your samples as indexes in the vocabulary. So the set of words might look like this:
#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])
When you one-hot encode that you will get a matrix 3 x 26:
vocab = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])
def onHot(sequences, dimension=len(vocab)):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1
return results
onHot(sentences)
Which results in thee one-hot encoded samples with a 26 letter vocabulary ready to be fed to a neural network:
array([[1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
Upvotes: 2
Reputation: 402813
(Comments inlined.)
# Sort and extract the indices.
idx = sorted(a.values())
# Initialise a matrix of zeros.
aa = np.zeros((len(idx), max(idx) + 1))
# Assign 1 to appropriate indices.
aa[np.arange(len(aa)), idx] = 1
print (aa)
array([[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1.]])
numpy.eye
idx = sorted(a.values())
eye = np.eye(max(idx) + 1)
aa = eye[idx]
print (aa)
array([[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 0., 1.]])
Upvotes: 6