blue-sky
blue-sky

Reputation: 53816

Intuition for varying number of nodes in deep learning model

Other than just trial and error, what impact does varying the number of nodes in a deep learning model achieve?

How I interpret it is this: each learned representation of a layer is a dense vector if the number of nodes is low and inversely each representation is a sparse vector if the number of nodes is high. How does this contribute to more or less accurate training accuracy?

Upvotes: 3

Views: 438

Answers (3)

KonstantinosKokos
KonstantinosKokos

Reputation: 3453

A neural network can be seen as a function approximation tool. The quality of the approximation is defined by its' error, i.e. how far the prediction is from the underlying ground truth. If we leave the practicioner approach (trial & error) aside, there's two theories through which we can investigate the effect of number of nodes (aka width) on the network's quality; one is theory of computation, and the other is algebraic topology. Neither has yet provided results immediately translatable to "if you add another node then this happens", but both have some very nice insights to offer. I am not sure if this is the kind of answer you are expecting, but I will try to very briefly walk you through the major points of what the latter field offers in terms of explanations.

Algebraic Topology / Control Theory

  1. A "shallow" network (i.e. a single dense layer) can approximate with arbitrary low error any continuous function under the assumption of no constraints on number of nodes. What this says is that your network can learn (almost) perfectly whatever you throw at it, no matter how complex it is, provided you can let it use as many nodes as it wants (potentially countably infinite). Practically, even though we know that there exists a shallow network that approximates with error ε→ 0 a continuous function f, we do not know what that network is or how to estimate its parameters. Generally, the more complex f is and the lower we want ε to be, the more nodes we would need, up to the point where training becomes unfeasible due to the curse of dimensionality. In very applied terms, this means that the wider your layer, the richer your representation, the more accurate your prediction. As a side effect you will also have more parameters that need to be trained, thus more data requirements, and measures against overfitting will become necessary.
  2. A high-rank tensor, such as the ones usually used as objective functions for neural networks, can be decomposed into a series of potentially lower rank tensors. This effectively reduces degrees of freedom and makes numeric representation easier with far lesser parameters. However, determining the rank of the decomposition (the number of summands) is NP-Hard, as is determining the coefficients themselves. The shallow network corresponds to the canonical decomposition, so due to it being NP-Hard, no claims can be made regarding the number of nodes necessary to construct a perfect approximation. What we do know, however, is that recurrent networks correspond to another sort of decomposition, the Tensor-Train decomposition, which is far more memory efficient and stable; therefore we know that a shallow network would need exponentially more width to mimic a recurrent network of the same width. Similarly, we know that a convolutional network corresponds the Hierarchical-Tucker decomposition, which is also more efficient than the shallow network; therefore a conv layer can compute in polynomial size what would require super-polynomial size for a shallow layer.

Refs:

  1. Approximation by Superpositions of a Sigmoidal Function
  2. Convolutional Rectifier Networks as Generalized Tensor Decomposition
  3. Expressive Power of Recurrent Neural Networks

TL;DR: We don't know much about how much width is necessary for some approximation, but we can compare width-efficiency between different network types. We know a shallow network (fully-connected layer) can approximate anything if we let it grow with no constraints. We also know that an exponential increase in its size is equivalent to a linear increase in a recurrent layer size, and that a super-polynomial increase in its size is equivalent to a polynomial increase in a convolutional layer. So if you're adding width, it better be on an RNN cell :)

The computational theory perspective follows a different route; that is, translating various network types to computation theoretic machines and inspecting their Turing degree. There are claims made about the number of nodes necessary to simulate a Turing machine using shallow nets, and how various networks relate to one-another in terms of size complexity, but I'm not sure if this is anywhere close to what you're asking so I'll skip this part.

I did not go into the comparison between width and depth efficiency either, as this is not something you're asking, but there are many more experimental results on that topic (and many SO answers far better than I could ever write myself).

Upvotes: 2

Zaw Lin
Zaw Lin

Reputation: 5708

Your question can be phrased alternatively as How do width and depth of deep learning models affect the final performance?. There is a very good answer on https://stats.stackexchange.com/questions/214360/what-are-the-effects-of-depth-and-width-in-deep-neural-networks. I reproduce some of the answers below:

  • Widdening consistently improves performance across residual networks of different depth;
  • Increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed;
  • There doesn’t seem to be a regularization effect from very high depth in residual net- works as wide networks with the same number of parameters as thin ones can learn same or better representations. Furthermore, wide networks can successfully learn with a 2 or more times larger number of parameters than thin ones, which would re- quire doubling the depth of thin networks, making them infeasibly expensive to train.

As it happens, while I was studying for school module this issue was brought up but in a simplfied manner for easier analysis. You can see the both the assignment's question and answer at this link(https://drive.google.com/file/d/1ZCGQuekVf6KcNUh_M4_uOT3ihX7g7xg9/view?usp=sharing).

The conclusion that I have reached in this assignment(which you can see in more details at page 7 in eassy.pdf) is that wider networks generally has better capacity but are also more prone to overfitting.

Intuitively, you can imagine it this way. Wider nodes essentially mean that you are decomposing the input space into multiple, potentially overlapping output spaces, which you then recombine at the next layer. If there are more nodes in that layer, you have a larger set of potential output spaces which directly translate into capacity. Depth in general in fact does not directly translate into capacity contrary to the popular belief.

As a last note, your interpretation is not necessarily correct if you mean sparse==many zero values in the vector. If relu activation is used, then it is true that there are many zeros in the vector. However, in general the number of close to zero nodes in the representation vector is not correlated with number of nodes.

Upvotes: 1

yo_man
yo_man

Reputation: 412

As far as I know, why over-parameterised networks work well with SGD like optimization techniques is still not well understood. We know that deep networks generalize better over new test data. Increasing the number of units increases the capacity of the model to learn with more and more training data. Sure, there will be a lot of redundancy amongst the nodes and you can end up with sparse models if using an appropriate regularisation on the weights. For example, a network with 1000-1000-1000 (3 dense layers with 1000 unit each) might give you an accuracy of 90% with 100k training samples. It might happen that you come across another 500k training samples and the accuracy is still 90%. The model has possibly reached a saturation point and you would need to increase the units/layer or modify the model architecture.

Upvotes: 0

Related Questions