Reputation: 2457
I am going to implement RNN using Pytorch . But , before that , I am having some difficulties in understanding the character level one-hot encoding which is asked in the question .
Please find below the question
Choose the text you want your neural network to learn, but keep in mind that your data set must be quite large in order to learn the structure! RNNs have been trained on highly diverse texts (novels, song lyrics, Linux Kernel, etc.) with success, so you can get creative. As one easy option, Gutenberg Books is a source of free books where you may download full novels in a .txt format.
We will use a character-level representation for this model. To do this, you may use extended ASCII with 256 characters. As you read your chosen training set, you will read in the characters one at a time into a one-hot-encoding, that is, each character will map to a vector of ones and zeros, where the one indicates which of the characters is present:
char → [0, 0, · · · , 1, · · · , 0, 0] Your RNN will read in these length-256 binary vectors as input.
So , For example , I have read a novel in python. Total unique characters is 97. and total characters is somewhere around 300,000 .
So , will my input be 97 x 256 one hot encoded matrix ?
or will it be 300,000 x 256 one hot encoded matrix ?
Upvotes: 0
Views: 164
Reputation: 567
One hot assumes each of your vector should be different in one place. So if you have 97 unique character then i think you should use a 1-hot vector of size ( 97 + 1 = 98). The extra vector maps all the unknown character to that vector. But you can also use a 256 length vector. So you input will be:
B x N x V ( B = batch size, N = no of characters , V = one hot vector size).
But if you are using libraries they usually ask the index of characters in vocabulary and they handle index to one hot conversion. Hope that helps.
Upvotes: 1