Zijun Xue
Zijun Xue

Reputation: 71

In Keras, is there any function similar to the zero_grad() in Pytorch?

In Pytorch, we can call zero_grad() to clear the gradients. In Keras, do we have a similar function so that we can achieve the same thing? For example, I want to accumulate gradients among some batches.

Upvotes: 7

Views: 1559

Answers (2)

Little Train
Little Train

Reputation: 902

If in custom training loop, its easy to realize:

...
# this is a glance of your custom training loop
# consider a`flag` has defined to control your behavior 
# consider a `buf= []` has defined to control your behavior 
with tf.GradientTape() as tape:
    loss = ...
grads = tape.gradient(loss, model.trainable_variables)
if flag: # do not accumulate grads 
    _grads = some_func(buf) # deal with accumulated grads in buf
    buf = [] # clear buf
    optimizer.apply_gradients(zip(_grads, model.trainable_variables))
else: # accumulate grads 
    buf.append(grads) 
...  

If in high level Keras API 'model.compile(), model.fit(),', I have no idea because I both use TF2 and Pytorch, where I prefer custom training loop, which is an easier way to narrow the distance between the two.

Upvotes: 1

Giovanni Minelli
Giovanni Minelli

Reputation: 126

In Pytorch the gradients are accumulated for every variables and the loss value is distribuited among them all. Then the optimizer is the one in charge of making the update to the model parameters (specified at the initialization) and since the update values are ever kept in memory you have to zero the value of update at start.

optimizer = torch.optim.Adam(itertools.chain(*param_list), lr=opt.lr, ...)
...
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()

In keras with gradient tapes you are wrapping a bunch of operation for which variables you want to compute gradients. You call the gradient method on the tape to compute the update passing the loss value and the variables for which you have to compute the gradient update. The optimizer just apply a single update to a single parameter (for the entire list of updates-params you specified).

with tf.GradientTape() as tape:
    loss = ...
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

you can use .fit() method instead, that does all of that under the hood.

If your aim is to accumulate multiple times the update, in Keras there is no standard method but you can do it more easily with tapes, accumulating the update values before apply them (See this https://www.tensorflow.org/api_docs/python/tf/GradientTape#:~:text=To%20compute%20multiple%20gradients%20over%20the%20same%20computation).

A good solution to do it with .fit() is explained here: How to accumulate gradients for large batch sizes in Keras

If you want to know more about how the parameters gradients tracked efficiently to distribuite the loss value and understand the whole process better, have a look at (Wikipedia) Automatic differentiation

Upvotes: 2

Related Questions