Reputation: 53816
Other than just trial and error, what impact does varying the number of nodes in a deep learning model achieve?
How I interpret it is this: each learned representation of a layer is a dense vector if the number of nodes is low and inversely each representation is a sparse vector if the number of nodes is high. How does this contribute to more or less accurate training accuracy?
Upvotes: 3
Views: 438
Reputation: 3453
A neural network can be seen as a function approximation tool. The quality of the approximation is defined by its' error, i.e. how far the prediction is from the underlying ground truth. If we leave the practicioner approach (trial & error) aside, there's two theories through which we can investigate the effect of number of nodes (aka width) on the network's quality; one is theory of computation, and the other is algebraic topology. Neither has yet provided results immediately translatable to "if you add another node then this happens", but both have some very nice insights to offer. I am not sure if this is the kind of answer you are expecting, but I will try to very briefly walk you through the major points of what the latter field offers in terms of explanations.
Algebraic Topology / Control Theory
Refs:
TL;DR: We don't know much about how much width is necessary for some approximation, but we can compare width-efficiency between different network types. We know a shallow network (fully-connected layer) can approximate anything if we let it grow with no constraints. We also know that an exponential increase in its size is equivalent to a linear increase in a recurrent layer size, and that a super-polynomial increase in its size is equivalent to a polynomial increase in a convolutional layer. So if you're adding width, it better be on an RNN cell :)
The computational theory perspective follows a different route; that is, translating various network types to computation theoretic machines and inspecting their Turing degree. There are claims made about the number of nodes necessary to simulate a Turing machine using shallow nets, and how various networks relate to one-another in terms of size complexity, but I'm not sure if this is anywhere close to what you're asking so I'll skip this part.
I did not go into the comparison between width and depth efficiency either, as this is not something you're asking, but there are many more experimental results on that topic (and many SO answers far better than I could ever write myself).
Upvotes: 2
Reputation: 5708
Your question can be phrased alternatively as How do width and depth of deep learning models affect the final performance?
. There is a very good answer on https://stats.stackexchange.com/questions/214360/what-are-the-effects-of-depth-and-width-in-deep-neural-networks. I reproduce some of the answers below:
- Widdening consistently improves performance across residual networks of different depth;
- Increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed;
- There doesn’t seem to be a regularization effect from very high depth in residual net- works as wide networks with the same number of parameters as thin ones can learn same or better representations. Furthermore, wide networks can successfully learn with a 2 or more times larger number of parameters than thin ones, which would re- quire doubling the depth of thin networks, making them infeasibly expensive to train.
As it happens, while I was studying for school module this issue was brought up but in a simplfied manner for easier analysis. You can see the both the assignment's question and answer at this link(https://drive.google.com/file/d/1ZCGQuekVf6KcNUh_M4_uOT3ihX7g7xg9/view?usp=sharing).
The conclusion that I have reached in this assignment(which you can see in more details at page 7 in eassy.pdf) is that wider networks generally has better capacity but are also more prone to overfitting.
Intuitively, you can imagine it this way. Wider nodes essentially mean that you are decomposing the input space into multiple, potentially overlapping output spaces, which you then recombine at the next layer. If there are more nodes in that layer, you have a larger set of potential output spaces which directly translate into capacity. Depth in general in fact does not directly translate into capacity contrary to the popular belief.
As a last note, your interpretation is not necessarily correct if you mean sparse==many zero values in the vector. If relu activation is used, then it is true that there are many zeros in the vector. However, in general the number of close to zero nodes in the representation vector is not correlated with number of nodes.
Upvotes: 1
Reputation: 412
As far as I know, why over-parameterised networks work well with SGD like optimization techniques is still not well understood. We know that deep networks generalize better over new test data. Increasing the number of units increases the capacity of the model to learn with more and more training data. Sure, there will be a lot of redundancy amongst the nodes and you can end up with sparse models if using an appropriate regularisation on the weights. For example, a network with 1000-1000-1000 (3 dense layers with 1000 unit each) might give you an accuracy of 90% with 100k training samples. It might happen that you come across another 500k training samples and the accuracy is still 90%. The model has possibly reached a saturation point and you would need to increase the units/layer or modify the model architecture.
Upvotes: 0