One hot encoding characters

Question

Is there a possibilty to one-hot encode characters of a text in Tensorflow or Keras?

tf.one_hot seem to take only integers.
tf.keras.preprocessing.text.one_hot seems to one-hot encode sentences to words, but not to characters.

Beside that, tf.keras.preprocessing.text.one_hot works really strange, since the response does not really seem one-hot encoded, since the following code:

text = "ab bba bbd"
res = tf.keras.preprocessing.text.one_hot(text=text,n=3)
print(res)

Lead to this result:

[1,2,2]

Every time I run this program, the output is a different 3d vector, sometimes it is [1,1,1] or [2,1,1]. The documentation says, that unicity is not guaranteed, but this seems really senseless to me.

user3921232 · Accepted Answer

I found a nice answer based on pure python, unfortunately I do not find the source anymore. It first converts every char to an int, and then replaces the int with an one-hot array. It has unicity over the whole program, even over all programms if the alphabet is the same length and the same order.

    # Is the alphabet of all possible chars you want to convert
    alphabet = "abcdefghijklmnopqrstuvwxyz0123456789"

    def convert_to_onehot(data):
        #Creates a dict, that maps to every char of alphabet an unique int based on position
        char_to_int = dict((c,i) for i,c in enumerate(alphabet))
        encoded_data = []
        #Replaces every char in data with the mapped int
        encoded_data.append([char_to_int[char] for char in data])
        print(encoded_data) # Prints the int encoded array

        #This part now replaces the int by an one-hot array with size alphabet
        one_hot = []
        for value in encoded_data:
            #At first, the whole array is initialized with 0
            letter = [0 for _ in range(len(alphabet))]
            #Only at the number of the int, 1 is written
            letter[value] = 1
            one_hot.append(letter)
        return one_hot

   print(convert_to_onehot("hello world"))

One hot encoding characters

Answers (2)

Related Questions