Reputation: 3
I am looking at code for an RNN Language Model. I am confused as to 1) how the training pairs (x,y) are constructed and subsequently 2) how the loss is computed. The code borrows from the Tensorflow RNN tutorial ( reader module ).
Within the reader module, a generator, ptb_iterator, is defined. It takes in the data as one sequence and yields x,y pairs in accordance to the batch size and the number of steps you wish to 'unroll' the RNN. It is best to look at the entire definition first but the part that confused me is this:
for i in range(epoch_size):
x = data[:, i*num_steps:(i+1)*num_steps]
y = data[:, i*num_steps+1:(i+1)*num_steps+1]
yield (x, y)
which is documented as:
*Yields:
Pairs of the batched data, each a matrix of shape [batch_size, num_steps].
The second element of the tuple is the same data time-shifted to the
right by one.*
So if understand correctly, for the data sequence [1 2 3 4 5 6]
and num_steps = 2
then for stochastic gradient descent(i.e. batch_size=1) the following pairs will be generated:
1) Is this the correct way to do this? Should it not be done so that the pairs are:
OR
2) Lastly, given that the pairs are generated as they are in the reader module, when it comes to training, will the loss computed not reflect the RNN's performance over a range of unrolled steps instead of num_steps
specified?
For example, the model will make a prediction for x=3 (from x=[3,4]) without considering that 2 came before it (i.e. unrolling the RNN one step instead of two).
Upvotes: 0
Views: 116
Reputation: 5206
Re (1), the goal is for the sequence size to be much bigger than 2, and then you don't want to replicate you entire dataset N times as you don't get much statistical power. Re (2) it's an approximation for use at training time; at prediction time you should predict with the entire sequence.
Upvotes: 1