Pablo Cordon
Pablo Cordon

Reputation: 399

TRANSFORMERS: Asking to pad but the tokenizer does not have a padding token

In trying to evaluate several transformers models sequentially with the same dataset to check which one performs better.

The list of models is this one:

MODELS = [
      ('xlm-mlm-enfr-1024'   ,"XLMModel"),
      ('distilbert-base-cased', "DistilBertModel"),
      ('bert-base-uncased'     ,"BertModel"),
      ('roberta-base'        ,"RobertaModel"),
      ("cardiffnlp/twitter-roberta-base-sentiment","RobertaSentTW"),
      ('xlnet-base-cased'     ,"XLNetModel"),
      #('ctrl'                ,"CTRLModel"),
      ('transfo-xl-wt103'    ,"TransfoXLModel"),
      ('bert-base-cased'       ,"BertModelUncased"),
      ('xlm-roberta-base'     ,"XLMRobertaModel"),
      ('openai-gpt'           ,"OpenAIGPTModel"),
      ('gpt2'                 ,"GPT2Model")

All of them work fine until the 'ctrl' model, which returns this error:

Asking to pad, but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer.eos_token e.g.)' or add a new pad token via 'tokenizer.add_special_tokens({'pad_token': '[PAD]'})'.

When tokenizing the sentences of my dataset.

The tokenizing code is

SEQ_LEN = MAX_LEN #(50)

for pretrained_weights, model_name in MODELS:

print("***************** INICIANDO " ,model_name,", weights ",pretrained_weights, "********* ")
print("carganzo el tokenizador ()")
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
print("creando el modelo preentrenado")
transformer_model = TFAutoModel.from_pretrained(pretrained_weights)
print("aplicando el tokenizador al dataset")

##APLICAMOS EL TOKENIZADOR##

def tokenize(sentence):
  
  tokens = tokenizer.encode_plus(sentence, max_length=MAX_LEN,
                               truncation=True, padding='max_length',
                               add_special_tokens=True, return_attention_mask=True,
                               return_token_type_ids=False, return_tensors='tf')
  return tokens['input_ids'], tokens['attention_mask']

# initialize two arrays for input tensors
Xids = np.zeros((len(df), SEQ_LEN))
Xmask = np.zeros((len(df), SEQ_LEN))

for i, sentence in enumerate(df['tweet']):
    Xids[i, :], Xmask[i, :] = tokenize(sentence)
    if i % 10000 == 0:
        print(i)  # do this so we can see some progress


arr = df['label'].values  # take label column in df as array

labels = np.zeros((arr.size, arr.max()+1))  # initialize empty (all zero) label array
labels[np.arange(arr.size), arr] = 1  # add ones in indices where we have a value`

I have tried to define the padding tokens as the solution tells me, but then this error appears

could not broadcast input array from shape (3,) into shape (50,)

in line

Xids[i, :], Xmask[i, :] = tokenize(sentence)

I have also tried this solution, and it doesn't work either.

If you have managed to read until here, thank you.

Any help is needed.

Upvotes: 20

Views: 49567

Answers (5)

Jihee Ryu
Jihee Ryu

Reputation: 1

In my code, the below is working!

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.unk_token if tokenizer.unk_token else tokenizer.eos_token

Upvotes: 0

J Zachary
J Zachary

Reputation: 1

tokenizer.eos_token_id = 151646
tokenizer.pad_token_id = 151645  
tokenizer.bos_token_id = 151648

Upvotes: -1

qing guo
qing guo

Reputation: 69

You can also try to assigning the eos_token (end-of-sentence token) to the pad_token.

tokenizer.pad_token = tokenizer.eos_token

Upvotes: 4

Googr
Googr

Reputation: 475

kkgarg idea was right, but you also need to update your model token embeding size. So, the code will be:

tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
model = TFAutoModel.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

Check this related issue.

Upvotes: 27

kkgarg
kkgarg

Reputation: 1376

You can add the [PAD] token using add_special_tokens API.

tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Upvotes: 10

Related Questions