Reputation: 617
Is there a possibilty to one-hot encode characters of a text in Tensorflow or Keras?
tf.one_hot
seem to take only integers.tf.keras.preprocessing.text.one_hot
seems to one-hot encode sentences
to words, but not to characters.Beside that, tf.keras.preprocessing.text.one_hot
works really strange, since the response does not really seem one-hot encoded, since the following code:
text = "ab bba bbd"
res = tf.keras.preprocessing.text.one_hot(text=text,n=3)
print(res)
Lead to this result:
[1,2,2]
Every time I run this program, the output is a different 3d vector, sometimes it is [1,1,1]
or [2,1,1]
. The documentation says, that unicity is not guaranteed, but this seems really senseless to me.
Upvotes: 3
Views: 3813
Reputation: 617
I found a nice answer based on pure python, unfortunately I do not find the source anymore. It first converts every char to an int, and then replaces the int with an one-hot array. It has unicity over the whole program, even over all programms if the alphabet is the same length and the same order.
# Is the alphabet of all possible chars you want to convert
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"
def convert_to_onehot(data):
#Creates a dict, that maps to every char of alphabet an unique int based on position
char_to_int = dict((c,i) for i,c in enumerate(alphabet))
encoded_data = []
#Replaces every char in data with the mapped int
encoded_data.append([char_to_int[char] for char in data])
print(encoded_data) # Prints the int encoded array
#This part now replaces the int by an one-hot array with size alphabet
one_hot = []
for value in encoded_data:
#At first, the whole array is initialized with 0
letter = [0 for _ in range(len(alphabet))]
#Only at the number of the int, 1 is written
letter[value] = 1
one_hot.append(letter)
return one_hot
print(convert_to_onehot("hello world"))
Upvotes: 3
Reputation: 556
You can use keras to_categorical
import tensorflow as tf
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(tf.keras.preprocessing.text.text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = tf.keras.utils.to_categorical(tf.keras.preprocessing
.text.one_hot(text, round(vocab_size*1.3)))
print(result)
Result
[[1, 2, 3, 4, 5, 6, 1, 7, 8]]
Upvotes: 2