ascar
ascar

Reputation: 3

Tensorflow: Recurrent neural network training pairs & the effect on the loss function

I am looking at code for an RNN Language Model. I am confused as to 1) how the training pairs (x,y) are constructed and subsequently 2) how the loss is computed. The code borrows from the Tensorflow RNN tutorial ( reader module ).

Within the reader module, a generator, ptb_iterator, is defined. It takes in the data as one sequence and yields x,y pairs in accordance to the batch size and the number of steps you wish to 'unroll' the RNN. It is best to look at the entire definition first but the part that confused me is this:

for i in range(epoch_size):
  x = data[:, i*num_steps:(i+1)*num_steps]
  y = data[:, i*num_steps+1:(i+1)*num_steps+1]
  yield (x, y)

which is documented as:

*Yields:
 Pairs of the batched data, each a matrix of shape [batch_size, num_steps].
 The second element of the tuple is the same data time-shifted to the
 right by one.*

So if understand correctly, for the data sequence [1 2 3 4 5 6] and num_steps = 2 then for stochastic gradient descent(i.e. batch_size=1) the following pairs will be generated:

  1. x=[1,2] , y=[2,3]
  2. x=[3,4] , y=[5,6]

1) Is this the correct way to do this? Should it not be done so that the pairs are:

  1. x=[1,2] , y=[2,3]
  2. x=[2,3] , y=[3,4] ... # allows for more datapoints

OR

  1. x=[1,2] , y=[3]
  2. x=[2,3] , y=[4] ... # ensures that all predictions are made with context length = num_steps

2) Lastly, given that the pairs are generated as they are in the reader module, when it comes to training, will the loss computed not reflect the RNN's performance over a range of unrolled steps instead of num_steps specified?

For example, the model will make a prediction for x=3 (from x=[3,4]) without considering that 2 came before it (i.e. unrolling the RNN one step instead of two).

Upvotes: 0

Views: 116

Answers (1)

Alexandre Passos
Alexandre Passos

Reputation: 5206

Re (1), the goal is for the sequence size to be much bigger than 2, and then you don't want to replicate you entire dataset N times as you don't get much statistical power. Re (2) it's an approximation for use at training time; at prediction time you should predict with the entire sequence.

Upvotes: 1

Related Questions