SantoshGupta7
SantoshGupta7

Reputation: 6197

Why is gradient clipping not supported with a distribution strategy in Tensorflow?

It looks like gradient clipping is not supported using a distribution strategy

https://github.com/tensorflow/tensorflow/blob/f9f6b4cec2a1bdc5781e4896d80cee1336a2fbab/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L383

("Gradient clipping in the optimizer " "(by setting clipnorm or clipvalue) is currently " "unsupported when using a distribution strategy.")

Any reason for this? I am tempted to define a custom def _minimize(strategy, tape, optimizer, loss, trainable_variables): with direct clipping the gradients.

Upvotes: 5

Views: 887

Answers (1)

vandenheuvel
vandenheuvel

Reputation: 359

GitHub user tomerk wrote:

There's two possible places to clip when you have distribution strategies enabled:

  • before gradients get aggregated (usually wrong)
  • after gradients get aggregated (usually right & what people expect)

We want it working w/ the second case (clipping after gradients are aggregated). The issue is the optimizers are written with clipping happening in the code before aggregation does.

We looked into changing this, but it would have required either:

  • api changes that break existing users of optimizer apply_gradients/other non-minimize methods
  • changing the signatures of methods optimizer implementers need to implement, breaking existing custom optimizers

So rather than:

  • quietly doing clipping in the wrong place
  • increasing churn & breaking existing users or existing custom optimizers just for this individual feature

We instead decided to leave this disabled for now. We'll roll support for this into a larger optimizer refactoring that solves a larger set of issues.

This has now been implemented.

Upvotes: 0

Related Questions