Ankit Bindal
Ankit Bindal

Reputation: 1529

Tensorflow: Multiple loss functions vs Multiple training ops

I am creating a Tensorflow model which predicts multiple outputs (with different activations). I think there are two ways to do this:

Method 1: Create multiple loss functions (one for each output), merge them (using tf.reduce_mean or tf.reduce_sum) and pass it to the training op like so:

final_loss = tf.reduce_mean(loss1 + loss2)
train_op = tf.train.AdamOptimizer().minimize(final_loss)

Method 2: Create multiple training operations and then group them like so:

train_op1 = tf.train.AdamOptimizer().minimize(loss1)
train_op2 = tf.train.AdamOptimizer().minimize(loss2)
final_train_op = tf.group(train_op1 train_op2)

My question is whether one method is advantageous over the other? Is there a third method I don't know?

Thanks

Upvotes: 33

Views: 22820

Answers (5)

markus-hinsche
markus-hinsche

Reputation: 1422

I will showcase how to implement a regression model using Tensorflow's functional API.

In multi-task learning, we need a base network that is shared between tasks and a network head for each individual task:

from tensorflow.keras import layers, models, Model

def create_base_cnn(input_shape):
    model = models.Sequential()
    model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu", input_shape=input_shape))
    model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), padding="same", activation="relu"))
    # put more layers if you like
    model.add(layers.Dense(128, activation="relu"))
    return model

def create_head(input_shape, name):
    model = models.Sequential(name=name)
    model.add(layers.Dense(128, activation="relu", input_shape=input_shape))
    model.add(layers.Dense(64, activation="relu"))
    # put more layers if you like
    model.add(layers.Dense(1, activation="linear"))
    return model

We can now combine the base model with the heads.

# Create the model.
input_shape = (240, 180, 1)
base_model = create_base_cnn(input_shape)
head_model1 = create_head((128,), name="head1")
head_model2 = create_head((128,), name="head2")
model_input = layers.Input(shape=input_shape)

# Combine base with heads (using TF's functional API)
features = base_model(model_input)
model_output1 = head_model1(features)
model_output2 = head_model2(features)
model = Model(inputs=model_input, outputs=[model_output1, model_output2])

Finally to train the model we can refer to the different outputs by name (in my case: "head1" and "head2"). We can define a hyperparameter for the weight of each head in the loss function:

HEAD1_WEIGHT = 0.4
HEAD2_WEIGHT = 0.6
model.compile(
    optimizer="Adam",
    loss={"head1": "mse", "head2": "mse"},
    loss_weights={"head1": HEAD1_WEIGHT, "head2": HEAD2_WEIGHT},
    metrics={"head1": ["mae"], "head2": ["mae"]}
)
model.fit(dataset_training, validation_data, epochs)

Upvotes: 0

Sam Bobel
Sam Bobel

Reputation: 1824

I want to make a subtle point that I don't think was made in previous answers.

If you were using something like GradientDescentOptimizer, these would be very similar operations. That's because taking gradients is a linear operation, and the gradient of a sum is the same as the sum of the gradients.

But, ADAM does something special: regardless of the scale of your loss, it scales the gradients so that they're always on the order of your learning rate. If you multiplied your loss by 1000, it wouldn't affect ADAM, because the change it would be normalized away.

So, if your two losses are roughly the same magnitude, then it shouldn't make a difference. If one is much larger than the other, then keep in mind that summing before the minimization will essentially ignore the small one, while making two ops will spend equal effort minimizing both.

I personally like dividing them up, which gives you more control over how much to focus on one loss or the other. For example, if it was multi-task learning, and one task was more important to get right than the other, two ops with different learning rates roughly accomplishes this.

Upvotes: 27

khuang834
khuang834

Reputation: 971

The difference between the two methods is demonstrated clearly in this post on multi-task learning in tensorflow.

In short:

Method 1: This is called joint training, since it directly adds the losses together, the result is that all the gradients and updates are done with respect to both losses at the same time. Generally this is used when training multiple outputs using the same set of input features.

Method 2: This creates two separate optimizers and is called alternate training. This is used when you use a subset of input features for each of the outputs. Therefore, when feeding in the feature subset for train_op1, the sub-graph for train_op2 is untouched. Each optimizer can be called in an alternating order using different input features.

If you run both optimizer concurrently with the same input data, then the differences with method 1 is probably very minor.

Upvotes: 23

user1454804
user1454804

Reputation: 1080

Both of the method you recommended are correct. The difference is quite subtle. Main difference is that AdamOptimizer keeps separate gradient accumulators for each loss in second solution. Which one works better needs an experiment.

Upvotes: 2

nessuno
nessuno

Reputation: 27050

The method 1 is the correct one because you're defining only once the gradient graph (for computing the backpropagation). In this way, you use a single loss function with a single graph, for doing a single update of the same parameter (the update takes into account both terms of the loss).

The second method, instead, defines 2 different graphs for computing the gradient, and is wrong. When you execute the training op, you're executing in parallel (because you used tf.group / tf.tuple / tf.control_dependencies) the computation of the training operations.

The operations will compute two different losses and two different set of updated variables.

When the moment of updating the variables comes, you have a problem: which update operation executes first, the one defined by the first graph or the other? And in any case, you're discarding one computation, because one will overwrite the other. There's no synchronization in the update and there's no relation in the computed losses.

Upvotes: 6

Related Questions