What is the difference between these backward training methods in Pytorch?

Question

I am a 3-month DL freshman who is doing small NLP projects with Pytorch.
Recently I am trying to reappear a GAN network introduced by a paper, using my own text data, to generate some specific kinds of question sentences.

Here is some background... If you have no time or interest about it, just kindly read the following question is OK.
As that paper says, the generator is firstly trained normally with normal question data to make that the output at least looks like a real question. Then by using an auxiliary classifier's result (of classifying the outputs), the generator is trained again to just generate the specific (several unique categories) questions.
However, as the paper do not reveal its code, I have to do the code all myself. I have these three training thoughts, but I do not know their differences, could you kindly tell me about it?
If they have almost the same effect, could you tell me which is more recommended in Pytorch's grammar? Thank you very much!

Suppose the discriminator loss to generator is loss_G_D, the classifier loss to generator is loss_G_C, and loss_G_D and loss_G_C has the same shape, i.e. [batch_size, loss value], then what is the difference?
1.

optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_C = loss_func2(classifier(generated_data))
loss = loss_G+loss_C
loss.backward()
optimizer.step()

optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()

optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
optimizer.step()

optimizer.zero_grad()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()

Additional info: I observed that the classifier's classification loss is always very big compared with generator's loss, like -300 vs 3. So maybe the third one is better?

abe · Accepted Answer

First of all:

loss.backward() backpropagates the error and assigns a gradient for every parameter along the way that has requires_grad=True.

optimizer.step() updates the model parameters using their stored gradients

optimizer.zero_grad() sets the gradients to 0, so that you can backpropagate your loss and update your model parameters for each batch without interfering with other batches.

1 and 2 are quite similar, but if your model uses batch statistics or you have an adaptive optimizer they will probably perform differently. However, for instance, if your model doesn't use batch statistics and you have a plain old SGD optimizer, they will produce the same result, even though 1 would be faster since you do the backprop only once.

3 is a completely different case, since you update your model parameters with loss_G_D.backward() and optimizer.step() before processing and backpropagating loss_G_C.

Given all of these, it's up to you which one to choose depending on your application.

What is the difference between these backward training methods in Pytorch?

Answers (1)

Related Questions