Reputation: 9869
I want to accumulate the gradients before I do a backward pass. So wondering what the right way of doing it is. According to this article it's:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
whereas I expected it to be:
model.zero_grad() # Reset gradients tensors
loss = 0
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss += loss_function(predictions, labels) # Compute loss function
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
optimizer.step() # Now we can do an optimizer step
model.zero_grad()
loss = 0
where I accumulate the loss and then divide by the accumulation steps to average it.
Secondary question, if I am right, would you expect my method to be quicker considering I only do the backward pass every accumulation steps?
Upvotes: 9
Views: 3024
Reputation: 9869
So according to the answer here, the first method is memory efficient. The amount of work required is more or less the same in both methods.
The second method keeps accumulating the graph, so would require accumulation_steps
times more memory. The first method calculates the gradients straight away (and simply adds gradients) so requires less memory.
Upvotes: 4
Reputation: 7
The backward pass loss.backward()
is the operation that actually computes the gradients.
If you only do the forward pass (predictions = model(inputs)
) no gradients will be computed and no accumulation is thus possible.
Upvotes: -2