CuCaRot
CuCaRot

Reputation: 1308

Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that

Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.

In detail, they inited all learnable parameters by normal distribution formula. During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer

I reproduced the code from pytorch GAN zoo Github's repo

def forward(self, x, equalized):
    # generate He constant depend on the size of tensor W
    size = self.module.weight.size()
    fan_in = prod(size[1:])
    weight = math.sqrt(2.0 / fan_in)
    '''
    A module example:

    import torch.nn as nn
    module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias) 
    '''
    x = self.module(x)

    if equalized:
        x *= weight
    return x

At first, I thought the He constant will be as He's paper

formula

Normally, n_l>2 so w_l can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper \hat{w}_i=\frac{w_i}{c}\rightarrow prevent vanishing gradient.

However, the code shows that \hat{w}_i=w_i*c.

In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.

I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.

Please help me explain it, thank you!

Upvotes: 3

Views: 573

Answers (2)

CuCaRot
CuCaRot

Reputation: 1308

In deep learning, a layer input is the previous one. Normally, their statistical distributions will change after a few iterations.

For instance, consider a fully connected layer with a weight shape of [2,1], where each value is initialized with a normal distribution N(0,1). This layer will produce an output with a variance of 2 (Ref). Due to the fluctuating distribution, the more inferences are made, the more dynamic the output from that node will become. This increased dynamism makes it challenging for the model to adapt to such a dynamic range, a phenomenon known as internal covariance shift.

Equalized learning rate is to stabilize this process by normalizing each layer by its shift.

Upvotes: 0

Marzi Heidari
Marzi Heidari

Reputation: 2730

There is already an explanation in the paper for the reason for scaling down the parameters in every single pass:

The benefit of doing this dynamically instead of during initialization is somewhat subtle and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter. As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time.

I believe multiplying He's constant in every pass ensures that the range of the parameter will not be too wide at any point in backpropagation and therefore they will not take so long to adjust. So if for example discriminator at some point in the learning process adjusted faster than the generator, it will not take the generator to adjust itself and consequently, the learning process of them will equalize.

Upvotes: 1

Related Questions