Reputation: 255
I am trying to go through Tensorflow's inception code for multiple GPUs (on 1 machine). I am confused because we get multiple losses from the different towers, aka the GPUs, as I understand, but the loss
variable evaluated seems to only be of the last tower and not a sum of the losses from all towers:
for step in xrange(FLAGS.max_steps):
start_time = time.time()
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
Where loss
was last defined specifically for each tower:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
loss = _tower_loss(images_splits[i], labels_splits[i], num_classes,
scope)
Could someone explain where the step is to combine the losses from different towers? Or are we simply a single tower's loss as representative of the other tower's losses as well?
Here's the link to the code: https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py#L336
Upvotes: 2
Views: 919
Reputation: 161
Yes, according to this code, losses are not summed or averaged across gpus. Loss per gpu is used inside of each gpu (tower) for gradient calculation. Only gradients are synchronized. So the isnan test is only done for the portion of data processed by the last gpu. This is not crucial but can be a limitation.
If really needed, I think you can do as follows to get averaged loss cross gpus:
per_gpu_loss = []
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
...
per_gpu_loss.append(loss)
mean_loss = tf.reduce_mean(per_gpu_loss, name="mean_loss")
tf.summary.scalar('mean_loss', mean_loss)
and then replace loss in sess.run as mean_loss:
_, loss_value = sess.run([train_op, mean_loss])
loss_value is now an average across losses processed by all the gpus.
Upvotes: 1
Reputation: 1065
For monitoring purposes, considering all towers work as expected, single tower's loss is as representative as average of all towers' losses. This is due to the fact that there is no relation between batch and tower it is assigned to.
But the train_op
uses gradients from all towers, as per line 263, 278 so technically training takes into account batches from all towers, as it should be.
Note, that average of losses will have lower variance than single tower's loss, but they will have the same expectation.
Upvotes: 1