Charlie Parker
Charlie Parker

Reputation: 5291

Is hidden and output the same for a GRU unit in Pytorch?

I do understand conceptually what an LSTM or GRU should (thanks to this question What's the difference between "hidden" and "output" in PyTorch LSTM?) BUT when I inspect the output of the GRU h_n and output are NOT the same while they should be...

(Pdb) rnn_output
tensor([[[ 0.2663,  0.3429, -0.0415,  ...,  0.1275,  0.0719,  0.1011],
         [-0.1272,  0.3096, -0.0403,  ...,  0.0589, -0.0556, -0.3039],
         [ 0.1064,  0.2810, -0.1858,  ...,  0.3308,  0.1150, -0.3348],
         ...,
         [-0.0929,  0.2826, -0.0554,  ...,  0.0176, -0.1552, -0.0427],
         [-0.0849,  0.3395, -0.0477,  ...,  0.0172, -0.1429,  0.0153],
         [-0.0212,  0.1257, -0.2670,  ..., -0.0432,  0.2122, -0.1797]]],
       grad_fn=<StackBackward>)
(Pdb) hidden
tensor([[[ 0.1700,  0.2388, -0.4159,  ..., -0.1949,  0.0692, -0.0630],
         [ 0.1304,  0.0426, -0.2874,  ...,  0.0882,  0.1394, -0.1899],
         [-0.0071,  0.1512, -0.1558,  ..., -0.1578,  0.1990, -0.2468],
         ...,
         [ 0.0856,  0.0962, -0.0985,  ...,  0.0081,  0.0906, -0.1234],
         [ 0.1773,  0.2808, -0.0300,  ..., -0.0415, -0.0650, -0.0010],
         [ 0.2207,  0.3573, -0.2493,  ..., -0.2371,  0.1349, -0.2982]],

        [[ 0.2663,  0.3429, -0.0415,  ...,  0.1275,  0.0719,  0.1011],
         [-0.1272,  0.3096, -0.0403,  ...,  0.0589, -0.0556, -0.3039],
         [ 0.1064,  0.2810, -0.1858,  ...,  0.3308,  0.1150, -0.3348],
         ...,
         [-0.0929,  0.2826, -0.0554,  ...,  0.0176, -0.1552, -0.0427],
         [-0.0849,  0.3395, -0.0477,  ...,  0.0172, -0.1429,  0.0153],
         [-0.0212,  0.1257, -0.2670,  ..., -0.0432,  0.2122, -0.1797]]],
       grad_fn=<StackBackward>)

they are some transpose of each other...why?

Upvotes: 1

Views: 7629

Answers (3)

ndrwnaguib
ndrwnaguib

Reputation: 6115

They are not really the same. Consider that we have the following Unidirectional GRU model:

import torch.nn as nn
import torch

gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True)

Please make sure you observe the input shape carefully.

inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)

Definitely,

torch.equal(out, hn)
False

One of the most efficient ways that helped me to understand the output vs. hidden states was to view the hn as hn.view(num_layers, num_directions, batch, hidden_size) where num_directions = 2 for bidirectional recurrent networks (and 1 other wise, i.e., our case). Thus,

hn_conceptual_view = hn.view(3, 1, 1024, 50)

As the doc states (Note the italics and bolds):

h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len (i.e., for the last timestep)

In our case, this contains the hidden vector for the timestep t = 112, where the:

output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features h_t from the last layer of the GRU, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively.

So, consequently, one can do:

torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
True

Explanation: I compare the last sequence from all batches in out[:, -1] to the last layer hidden vectors from hn[-1, 0, :, :]


For Bidirectional GRU (requires reading the unidirectional first):

gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True bidirectional = True)
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)

View is changed to (since we have two directions):

hn_conceptual_view = hn.view(3, 2, 1024, 50)

If you try the exact code:

torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
False

Explanation: This is because we are even comparing wrong shapes;

out[:, 0].shape
torch.Size([1024, 100])
hn_conceptual_view[-1, 0, :, :].shape
torch.Size([1024, 50])

Remember that for bidirectional networks, hidden states get concatenated at each time step where the first hidden_state size (i.e., out[:, 0, :50]) are the the hidden states for the forward network, and the other hidden_state size are for the backward (i.e., out[:, 0, 50:]). The correct comparison for the forward network is then:

torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 0, :, :])
True

If you want the hidden states for the backward network, and since a backward network processes the sequence from time step n ... 1. You compare the first timestep of the sequence but the last hidden_state size and changing the hn_conceptual_view direction to 1:

torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 1, :, :])
True

In a nutshell, generally speaking:

Unidirectional:

rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 1, B, H)

Where RECURRENT_MODULE is either GRU or LSTM (at the time of writing this post), B is the batch size, S sequence length, and E embedding size.

torch.equal(output[:, S, :], hn_conceptual_view[-1, 0, :, :])
True

Again we used S since the rnn_module is forward (i.e., unidirectional) and the last timestep is stored at the sequence length S.

Bidirectional:

rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True, bidirectional = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 2, B, H)

Comparison

torch.equal(output[:, S, :H], hn_conceptual_view[-1, 0, :, :])
True

Above is the forward network comparison, we used :H because the forward stores its hidden vector in the first H elements for each timestep.

For the backward network:

torch.equal(output[:, 0, H:], hn_conceptual_view[-1, 1, :, :])
True

We changed the direction in hn_conceptual_view to 1 to get hidden vectors for the backward network.


For all examples we used hn_conceptual_view[-1, ...] because we are only interested in the last layer.

Upvotes: 4

Novak
Novak

Reputation: 4783

There are three things you have to remember to make sense of this in PyTorch. This answer is written on the assumption that you are using something like torch.nn.GRU or the like, and that if you are making a multi-layer RNN with it, that you are using the num_layers argument to do so (rather than building one from scratch out of individual layers yourself.)

  1. The output will give you the hidden layer outputs of the network for each time-step, but only for the final layer. This is useful in many applications, particularly encoder-decoders using attention. (These architectures build up a 'context' layer from all the hidden outputs, and it is extremely useful to have them sitting around as a self-contained unit.)

  2. The h_n will give you the hidden layer outputs for the last time-step only, but for all the layers. Therefore, if and only if you have a single layer architecture, h_n is a strict subset of output. Otherwise, output and h_n intersect, but are not strict subsets of one another. (You will often want these, in an encoder-decoder model, from the encoder in order to jumpstart the decoder.)

  3. If you are using a bidirectional output and you want to actually verify that part of h_n is contained in output (and vice-versa) you need to understand what PyTorch does behind the scenes in the organization of the inputs and outputs. Specifically, it concatenates a time-reversed input with the time-forward input and runs them together. This is literal. This means that the 'forward' output at time T is in the final position of the output tensor sitting right next to the 'reverse' output at time 0; if you're looking for the 'reverse' output at time T, it is in the first position.

The third point in particular drove me absolute bonkers for about three hours the first time I was playing RNNs and GRUs. In fairness, it is also why h_n is provided as an output, so once you figure it out, you don't have to worry about it any more, you just get the right stuff from the return value.

Upvotes: 3

joyzaza
joyzaza

Reputation: 186

Is Not the transpose , you can get rnn_output = hidden[-1] when the layer of lstm is 1

hidden is a output of every cell every layer, it's shound be a 2D array for a specifc input time step , but lstm return all the time step , so the output of a layer should be hidden[-1]

and this situation discussed when batch is 1 , or the dimention of output and hidden need to add one

Upvotes: 0

Related Questions