Reputation: 429
I have two transformer networks. One with 3 heads per attention and 15 layers in total and second one with 5 heads per layer and 30 layers in total. Given an arbitrary set of documents (2048 tokens per each), how to find out, which network is going to be better to use and is less prone to overfitting?
In computer vision we have concept called: "receptive field", that allows us to understand how big or small network we need to use. For instance, if we have CNN with 120 layers and CNN with 70 layers, we can calculate their receptive fields and understand which one is going to perform better on a particular dataset of images.
Do you guys have something similar in NLP? How do you understand whether one architecture is more optimal to use versus another,having a set of text documents with unique properties?
Upvotes: 1
Views: 135
Reputation: 304
How do you understand whether one architecture is more optimal to use versus another, having a set of text documents with unique properties?
For modern Transformer-based Language Models (LMs), there are some empirical "scaling laws," such as the Chinchilla scaling laws (Wikipedia), that essentially say that larger (deeper) models with more layers, i.e., with more parameters tend to perform better. So far, most LMs seem to roughly follow Chinchilla scaling. There is another kind of scaling, which is closer to a "receptive field", that I talk about below.
Do you guys have something similar in NLP?
Kind of. Transformer-based LMs can be thought to have a "receptive field" similar to CNN layers, as the attention mechanism in the Transformer operates on a pre-defined "context window" or "context length", which is the maximum number of tokens the layer can look at ("attend to") at any given time, similar to a CNN kernel. However, with the introduction of new positional encoding (PE) approaches, such as Rotary Positional Encoding (RoPE), and modified attention architectures, like Sliding Window Attention (SWA), this is not strictly accurate.
Scaling in terms of "context length" is of much interest, but usually, it is very difficult to scale Transformers this way, because of attention being a ($\mathcal{O}(N^2)$) (O(N^2)) operation. So, usually, researchers go towards deeper architectures with more parameters ("over-parameterization") that can allow the model to "memorize" as much of the large training corpus as it can ("overfitting"), so that it can perform reasonably well, when fine-tuned for most down-stream tasks (that have at least some representative examples in the training corpus).
Upvotes: 1