Reputation: 3393
I going through a series of machine learning examples that use RNNs for document classification (many-to-one). In most tutorials, the RNN output of the last time step is used, i.e., fed into one or more dense layers to map it to the number of classes (e.g., [1], [2]).
However, I also came across some examples where, instead of the last output, the average of the outputs over all time steps is used (mean pooling?, e.g., [3]). The dimensions of this averaged output are of course the same as for the last output. So computationally, both approaches just work the same.
My questions is now, what is the intuition between the two different approaches. Due to recursive nature, the last output also reflects the output of the previous time steps. So why the idea of averaging the RNN outputs over all time steps. When to use what?
Upvotes: 2
Views: 1653
Reputation: 53758
Pooling over time is a specific technique that is used to extract the features from the input sequence. From this question:
The reason to do this, instead of "down-sampling" the sentence like in a CNN, is that in NLP the sentences naturally have different length in a corpus. This makes the feature maps different for different sentences, but we'd like to reduce the tensor to a fixed size to apply softmax or regression head in the end. As stated in the paper, it allows to capture the most important feature, one with the highest value for each feature map.
It's important to note here that max-over-time (or average-over-time) is usually an intermediate layer. In particular, there can be several of them in a row or in parallel (with different window size). The end result produced by the network can still be either many-to-one or many-to-many (at least in theory).
However, in most of the cases, there is a single output from the RNN. If the output must be a sequence, this output is usually fed into another RNN. So it all boils down to how exactly this single value is learned: take the last cell output or aggregate across the whole sequence or apply attention mechanism, etc.
Upvotes: 1