Reputation: 71
So I have an assignment involving Language Modelling and I passed all the unit tests but my code is too slow to run. I think it's because of the way I compute my loss. The formula we're given is the following:
My naive implementation is the following:
losses_batch_list = []
batch_size = log_probas.size(0)
for b in range(batch_size):
seq_length = max([i for i, e in enumerate(mask[b,:]) if e != 0]) + 1
loss_batch = 0
for t in range(seq_length):
for n in range(self.vocabulary_size):
if targets[b, t] == n:
loss_batch += log_probas[b, t, n].detach()
loss_batch = - loss_batch / seq_length
losses_batch_list.append(loss_batch)
loss = torch.tensor(np.mean(losses_batch_list))
return loss
But that loop runs for ever since the vocabulary size is the same approximately as GPT1 (~40 000) and the sequence length is up to 255 (something it is shorter because of padding, hence the mask). Does anyone have any tips on how to vectorize/speed this up? I know it's correct but I can't report any results with it... Thanks!
Upvotes: 0
Views: 50
Reputation: 188
B = batch_size
T = sequence_length (padded)
N = vocab_size
if type(mask_b) == torch.bool:
mask = mask.view(-1) # (B, T) -> (B*T,)
else:
mask = mask.bool().view(-1) # (B, T) -> (B*T,)
log_probas = log_probas.view(-1, N) # (B, T, N) -> (B*T, N)
targets = target.view(-1, 1) # (B, T) -> (B*T, 1)
loss = torch.gather(log_probas[mask], -1, target[mask]) # loss without padded tokens
loss = loss.mean()
Upvotes: 1