milkyway42
milkyway42

Reputation: 144

Gradient penalty in WGAN-GP

In Improved Training of Wasserstein GANs paper, Corollary 1 says f* has gradient norm 1 almost everywhere under Pr and Pg and adds a gradient penalty in the loss function which constrains the gradients to be close to 1. I get that this is an alternative to weight clipping, and uses 1 Lipschitz inequality.

But I don't get why we are limiting the gradient to be close to 1. If our generator performs well then we might need our gradient to be less than 1 to detect fine differences between real and generated data. Not only that but 1 Lipshitz inequality only states that our gradient be less than or equal to 1 (not merely equal to 1). Especially when our $\lambda$ is large a gradient less than 1 can have big impact on our loss function hence forcing gradient to become larger when in fact our current discriminator is performing well.

Upvotes: 0

Views: 1081

Answers (1)

Ayush Bachan
Ayush Bachan

Reputation: 21

There is also the term of difference of expected values in loss function, which will counter gradients close to one in later stages of training. Maybe weight decay to reduce the impact of gradients later on in training, but a large gradient on average throughout training would train faster. Since we want our L2 norm of gradient to be between -1 and 1, it penalizes larger (out of bound, i.e > 2 technically) gradients much more than in range. Also I can't think of other ways to penalize gradient, if done about 0 its slow learning.

Upvotes: 2

Related Questions