deef
deef

Reputation: 4760

Tensorflow CIFAR10 Multi GPU - Why Combined Loss?

In the TensorFlow CIFAR10 example, trained over multiple GPUs, the loss seems to be combined for each "tower", and the gradient is calculated from this combined loss.

    # Build the portion of the Graph calculating the losses. Note that we will
    # assemble the total_loss using a custom function below.
    _ = cifar10.loss(logits, labels)

    # Assemble all of the losses for the current tower only.
    losses = tf.get_collection('losses', scope)

    # Calculate the total loss for the current tower.
    total_loss = tf.add_n(losses, name='total_loss')

    # Attach a scalar summary to all individual losses and the total loss; do the
    # same for the averaged version of the losses.
    for l in losses + [total_loss]:
        # Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
        # session. This helps the clarity of presentation on tensorboard.
        loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
        tf.contrib.deprecated.scalar_summary(loss_name, l)

    return total_loss

I'm new to TensorFlow, but from my understanding, every time cifar10.loss is called, tf.add_to_collection('losses', cross_entropy_mean) is run and the loss from the current batch is being stored in the collection.

Then losses = tf.get_collection('losses', scope) is called, and all the losses are being retrieved from the collection. Then tf.add_n op is adding all the retrieved loss tensors from this "tower" together.

I expected the loss to be just from the current training step/batch, not all batches.

Am I misunderstanding something? Or is there a reason for combining the losses together?

Upvotes: 4

Views: 879

Answers (2)

Ishant Mrinal
Ishant Mrinal

Reputation: 4918

Why combined loss

The example you are referring is a example of data parallelism over multiple gpus. Data parallelism helps towards training deeper model with bigger batch_size. In this setting you need to combine loss from the gpus as each of the gpus is holding one part of the input batch (loss and gradients corresponding to that input part). One illustration is provided in the following example from tensorflow data parallism example.

Note: In case of model parallelism different subgraph of the model run on separate gpus and intermediate outputs are collected by the master.

example

if you want to train the model using a batch size of 256, for a deeper model (for example, resnet/inception)that mayn't fit into one single gpu (for example a 8 GB memory), so you can split the batch into two batches of size 128 and do forward pass of the model using the two batches on separate gpus and compute loss and gradients. The computed (loss. gradients) from each of the gpus are collected and averaged over. the averaged gradient is used to update the model parameters.

enter image description here

Upvotes: 1

Sara
Sara

Reputation: 21

If weight decay is enabled, it will also add it to the losses collection. Therefore, for each tower(scope), it will add_n all the losses: cross_entropy_mean and weight_decay.

Then Gradients are calculated for each tower(scope). At the end all the gradients for different towers (scopes) will get averaged in the average_gradients.

Upvotes: 1

Related Questions