Reputation: 5291
I do understand conceptually what an LSTM or GRU should (thanks to this question What's the difference between "hidden" and "output" in PyTorch LSTM?) BUT when I inspect the output of the GRU h_n
and output
are NOT the same while they should be...
(Pdb) rnn_output
tensor([[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
...,
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
grad_fn=<StackBackward>)
(Pdb) hidden
tensor([[[ 0.1700, 0.2388, -0.4159, ..., -0.1949, 0.0692, -0.0630],
[ 0.1304, 0.0426, -0.2874, ..., 0.0882, 0.1394, -0.1899],
[-0.0071, 0.1512, -0.1558, ..., -0.1578, 0.1990, -0.2468],
...,
[ 0.0856, 0.0962, -0.0985, ..., 0.0081, 0.0906, -0.1234],
[ 0.1773, 0.2808, -0.0300, ..., -0.0415, -0.0650, -0.0010],
[ 0.2207, 0.3573, -0.2493, ..., -0.2371, 0.1349, -0.2982]],
[[ 0.2663, 0.3429, -0.0415, ..., 0.1275, 0.0719, 0.1011],
[-0.1272, 0.3096, -0.0403, ..., 0.0589, -0.0556, -0.3039],
[ 0.1064, 0.2810, -0.1858, ..., 0.3308, 0.1150, -0.3348],
...,
[-0.0929, 0.2826, -0.0554, ..., 0.0176, -0.1552, -0.0427],
[-0.0849, 0.3395, -0.0477, ..., 0.0172, -0.1429, 0.0153],
[-0.0212, 0.1257, -0.2670, ..., -0.0432, 0.2122, -0.1797]]],
grad_fn=<StackBackward>)
they are some transpose of each other...why?
Upvotes: 1
Views: 7629
Reputation: 6115
They are not really the same. Consider that we have the following Unidirectional GRU model:
import torch.nn as nn
import torch
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True)
Please make sure you observe the input shape carefully.
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
Definitely,
torch.equal(out, hn)
False
One of the most efficient ways that helped me to understand the output vs. hidden states was to view the hn
as hn.view(num_layers, num_directions, batch, hidden_size)
where num_directions = 2
for bidirectional recurrent networks (and 1 other wise, i.e., our case). Thus,
hn_conceptual_view = hn.view(3, 1, 1024, 50)
As the doc states (Note the italics and bolds):
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len (i.e., for the last timestep)
In our case, this contains the hidden vector for the timestep t = 112
, where the:
output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features h_t from the last layer of the GRU, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively.
So, consequently, one can do:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
True
Explanation: I compare the last sequence from all batches in out[:, -1]
to the last layer hidden vectors from hn[-1, 0, :, :]
For Bidirectional GRU (requires reading the unidirectional first):
gru = nn.GRU(input_size = 8, hidden_size = 50, num_layers = 3, batch_first = True bidirectional = True)
inp = torch.randn(1024, 112, 8)
out, hn = gru(inp)
View is changed to (since we have two directions):
hn_conceptual_view = hn.view(3, 2, 1024, 50)
If you try the exact code:
torch.equal(out[:, -1], hn_conceptual_view[-1, 0, :, :])
False
Explanation: This is because we are even comparing wrong shapes;
out[:, 0].shape
torch.Size([1024, 100])
hn_conceptual_view[-1, 0, :, :].shape
torch.Size([1024, 50])
Remember that for bidirectional networks, hidden states get concatenated at each time step where the first hidden_state
size (i.e., out[:, 0,
:50
]
) are the the hidden states for the forward network, and the other hidden_state
size are for the backward (i.e., out[:, 0,
50:
]
). The correct comparison for the forward network is then:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 0, :, :])
True
If you want the hidden states for the backward network, and since a backward network processes the sequence from time step n ... 1
. You compare the first timestep of the sequence but the last hidden_state
size and changing the hn_conceptual_view
direction to 1
:
torch.equal(out[:, -1, :50], hn_conceptual_view[-1, 1, :, :])
True
In a nutshell, generally speaking:
Unidirectional:
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 1, B, H)
Where RECURRENT_MODULE
is either GRU or LSTM (at the time of writing this post), B
is the batch size, S
sequence length, and E
embedding size.
torch.equal(output[:, S, :], hn_conceptual_view[-1, 0, :, :])
True
Again we used S
since the rnn_module
is forward (i.e., unidirectional) and the last timestep is stored at the sequence length S
.
Bidirectional:
rnn_module = nn.RECURRENT_MODULE(num_layers = X, hidden_state = H, batch_first = True, bidirectional = True)
inp = torch.rand(B, S, E)
output, hn = rnn_module(inp)
hn_conceptual_view = hn.view(X, 2, B, H)
Comparison
torch.equal(output[:, S, :H], hn_conceptual_view[-1, 0, :, :])
True
Above is the forward network comparison, we used :H
because the forward stores its hidden vector in the first H
elements for each timestep.
For the backward network:
torch.equal(output[:, 0, H:], hn_conceptual_view[-1, 1, :, :])
True
We changed the direction in hn_conceptual_view
to 1
to get hidden vectors for the backward network.
For all examples we used hn_conceptual_view[-1, ...]
because we are only interested in the last layer.
Upvotes: 4
Reputation: 4783
There are three things you have to remember to make sense of this in PyTorch.
This answer is written on the assumption that you are using something like torch.nn.GRU or the like, and that if you are making a multi-layer RNN with it, that you are using the num_layers
argument to do so (rather than building one from scratch out of individual layers yourself.)
The output
will give you the hidden layer outputs of the network for each time-step, but only for the final layer. This is useful in many applications, particularly encoder-decoders using attention. (These architectures build up a 'context' layer from all the hidden outputs, and it is extremely useful to have them sitting around as a self-contained unit.)
The h_n
will give you the hidden layer outputs for the last time-step only, but for all the layers. Therefore, if and only if you have a single layer architecture, h_n
is a strict subset of output
. Otherwise, output
and h_n
intersect, but are not strict subsets of one another. (You will often want these, in an encoder-decoder model, from the encoder in order to jumpstart the decoder.)
If you are using a bidirectional output and you want to actually verify that part of h_n
is contained in output
(and vice-versa) you need to understand what PyTorch does behind the scenes in the organization of the inputs and outputs. Specifically, it concatenates a time-reversed input with the time-forward input and runs them together. This is literal. This means that the 'forward' output at time T is in the final position of the output
tensor sitting right next to the 'reverse' output at time 0; if you're looking for the 'reverse' output at time T, it is in the first position.
The third point in particular drove me absolute bonkers for about three hours the first time I was playing RNNs and GRUs. In fairness, it is also why h_n
is provided as an output, so once you figure it out, you don't have to worry about it any more, you just get the right stuff from the return value.
Upvotes: 3
Reputation: 186
Is Not the transpose , you can get rnn_output = hidden[-1] when the layer of lstm is 1
hidden is a output of every cell every layer, it's shound be a 2D array for a specifc input time step , but lstm return all the time step , so the output of a layer should be hidden[-1]
and this situation discussed when batch is 1 , or the dimention of output and hidden need to add one
Upvotes: 0