Reputation: 1940
I am trying to perform a manual gradient accumulation in Tensorflow 2.2.0 using Python 3.8.5.
I have a piece of code where I gather a series of gradients( of total number of grads_total_number), in a list: grads_list
and I do the accumulation using following code:
avg_accum_grads=[]
for grad_ind in range(grads_total_number):
avg_accum_grads.append(tf.reduce_mean(grads_list[grad_ind], axis=0))
I then intend to apply these grads via my optimizer to my model:
myopt.apply_gradients(zip(avg_accum_grads, model.trainable_variables),experimental_aggregate_gradients=True)
, where my optimizer is Adam defined as tf.keras.optimizers.Adam
However, due to the documentation here, I am confused if I have to set experimental_aggregate_gradients
to False. I could not clearly understand, here that I have done the accumulation manually, if I let it be true, it continues to accumulate?
Any help is very appreciated.
Upvotes: 4
Views: 533
Reputation:
If you are using distributed strategy .i.e. distributed training across multiple GPUs, multiple machines or TPUs, then experimental_aggregate_gradients
parameter comes into picture. In distributed strategy, the gradients are calculated on each replica, then they are aggregated across the replicas by summing them. Aggregating across the replicas would be taken care by itself if experimental_aggregate_gradients = True
else have to be taken care manually if experimental_aggregate_gradients = False
. It will not make any difference if you set experimental_aggregate_gradients
= True
or False
if you are not using the distributed strategy.
If you look into documentation of tf.distribute.Strategy, you will find the below parameter,
num_replicas_in_sync - Returns number of replicas over which gradients are aggregated.
To summarize, if you are using tf.distribute.Strategy
and,
experimental_aggregate_gradients = True
, then gradients computed in different replicas are aggregated automatically.experimental_aggregate_gradients = False
, then it's user responsibility to aggregate the gradients from the different replicas.By default the experimental_aggregate_gradients
is set to True
. It wouldn't make any difference if True
or False
unless using tf.distribute.Strategy
.
Upvotes: 1