Shall we set experimental_aggregate_gradients to False or True when manually accumulating gradients in Tensorflow?

Question

I am trying to perform a manual gradient accumulation in Tensorflow 2.2.0 using Python 3.8.5.

I have a piece of code where I gather a series of gradients( of total number of grads_total_number), in a list: grads_list and I do the accumulation using following code:

avg_accum_grads=[]    
for grad_ind in range(grads_total_number):
        avg_accum_grads.append(tf.reduce_mean(grads_list[grad_ind], axis=0))

I then intend to apply these grads via my optimizer to my model:

myopt.apply_gradients(zip(avg_accum_grads, model.trainable_variables),experimental_aggregate_gradients=True)

, where my optimizer is Adam defined as tf.keras.optimizers.Adam

However, due to the documentation here, I am confused if I have to set experimental_aggregate_gradients to False. I could not clearly understand, here that I have done the accumulation manually, if I let it be true, it continues to accumulate?

Any help is very appreciated.

user11530462 · Accepted Answer

If you are using distributed strategy .i.e. distributed training across multiple GPUs, multiple machines or TPUs, then experimental_aggregate_gradients parameter comes into picture. In distributed strategy, the gradients are calculated on each replica, then they are aggregated across the replicas by summing them. Aggregating across the replicas would be taken care by itself if experimental_aggregate_gradients = True else have to be taken care manually if experimental_aggregate_gradients = False. It will not make any difference if you set experimental_aggregate_gradients = True or False if you are not using the distributed strategy.

If you look into documentation of tf.distribute.Strategy, you will find the below parameter,

num_replicas_in_sync - Returns number of replicas over which gradients are aggregated.

To summarize, if you are using tf.distribute.Strategy and,

If experimental_aggregate_gradients = True, then gradients computed in different replicas are aggregated automatically.
If experimental_aggregate_gradients = False, then it's user responsibility to aggregate the gradients from the different replicas.

By default the experimental_aggregate_gradients is set to True. It wouldn't make any difference if True or False unless using tf.distribute.Strategy.

Shall we set experimental_aggregate_gradients to False or True when manually accumulating gradients in Tensorflow?

Answers (1)

Related Questions