Mark Padley
Mark Padley

Reputation: 51

Unable to use existing code working with base transformers on 'large' models

My Python code works OK for base transformer models, but when I attempt to use 'large' models, or roberta models I receive error mesages. The most common message I print below.

Epoch 1 / 40

RuntimeError Traceback (most recent call last) in () 12 13 #train model ---> 14 train_loss, _ = fine_tune() 15 # WE DON'T CARE ABOUT THE SECOND ITEM THE MODEL OUTPUTS (total_preds) 16 # We onlt want the average loss values here 'avg_loss'

5 frames /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias) 1688 if input.dim() == 2 and bias is not None: 1689 # fused op is marginally faster -> 1690 ret = torch.addmm(bias, input, weight.t()) 1691 else: 1692 output = input.matmul(weight.t())

RuntimeError: mat1 dim 1 must match mat2 dim 0

I am  guessing there is some kind of a mismatch between matrices(Tensors) such that an operation cannot occur. If I can better understand the issue, I can better address the necessary changes to my code. Her is the fine tuning function I am using...

def fine_tune():

model.train()

total_loss, total_accuracy = 0, 0

empty list to save model predictions

total_preds=[]

iterate over batches

for step,batch in enumerate(train_dataloader):

# progress update after every 50 batches.
if step % 50 == 0 and not step == 0:
  print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))

# push the batch to gpu
batch = [r.to(device) for r in batch]

sent_id, mask, labels = batch

# clear previously calculated gradients 
model.zero_grad()        

# get model predictions for the current batch
preds = model(sent_id, mask)

# compute the loss between actual and predicted values
loss = cross_entropy(preds, labels)

# add on to the total loss
total_loss = total_loss + loss.item()

# backward pass to calculate the gradients
loss.backward()

# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

# update parameters
optimizer.step()

# model predictions are stored on GPU. So, push it to CPU
preds=preds.detach().cpu().numpy()
# Length of preds is the same as the batch size

# append the model predictions
total_preds.append(preds)

compute the training loss of the epoch

avg_loss = total_loss / len(train_dataloader)

reshape the predictions in form of (number of samples, no. of classes)

total_preds = np.concatenate(total_preds, axis=0)

return avg_loss, total_preds

regards, Mark

Upvotes: 1

Views: 96

Answers (1)

Mark Padley
Mark Padley

Reputation: 51

I wrote a print statement to reveal the size of the input from the pre-trained model. This revealed that true size, namely 1024, rather than the default hard-code value of 768 in the program I have modified. An easy fix once I understood the problem. The moral of the story for me is, when a YouTuber ( a good one actually!) says " all transformers have an output dimension of 768" don't take that necessarily as gospel!

Upvotes: 1

Related Questions