Reputation: 399
In trying to evaluate several transformers models sequentially with the same dataset to check which one performs better.
The list of models is this one:
MODELS = [
('xlm-mlm-enfr-1024' ,"XLMModel"),
('distilbert-base-cased', "DistilBertModel"),
('bert-base-uncased' ,"BertModel"),
('roberta-base' ,"RobertaModel"),
("cardiffnlp/twitter-roberta-base-sentiment","RobertaSentTW"),
('xlnet-base-cased' ,"XLNetModel"),
#('ctrl' ,"CTRLModel"),
('transfo-xl-wt103' ,"TransfoXLModel"),
('bert-base-cased' ,"BertModelUncased"),
('xlm-roberta-base' ,"XLMRobertaModel"),
('openai-gpt' ,"OpenAIGPTModel"),
('gpt2' ,"GPT2Model")
All of them work fine until the 'ctrl'
model, which returns this error:
Asking to pad, but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer.eos_token e.g.)' or add a new pad token via 'tokenizer.add_special_tokens({'pad_token': '[PAD]'})'.
When tokenizing the sentences of my dataset.
The tokenizing code is
SEQ_LEN = MAX_LEN #(50)
for pretrained_weights, model_name in MODELS:
print("***************** INICIANDO " ,model_name,", weights ",pretrained_weights, "********* ")
print("carganzo el tokenizador ()")
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
print("creando el modelo preentrenado")
transformer_model = TFAutoModel.from_pretrained(pretrained_weights)
print("aplicando el tokenizador al dataset")
##APLICAMOS EL TOKENIZADOR##
def tokenize(sentence):
tokens = tokenizer.encode_plus(sentence, max_length=MAX_LEN,
truncation=True, padding='max_length',
add_special_tokens=True, return_attention_mask=True,
return_token_type_ids=False, return_tensors='tf')
return tokens['input_ids'], tokens['attention_mask']
# initialize two arrays for input tensors
Xids = np.zeros((len(df), SEQ_LEN))
Xmask = np.zeros((len(df), SEQ_LEN))
for i, sentence in enumerate(df['tweet']):
Xids[i, :], Xmask[i, :] = tokenize(sentence)
if i % 10000 == 0:
print(i) # do this so we can see some progress
arr = df['label'].values # take label column in df as array
labels = np.zeros((arr.size, arr.max()+1)) # initialize empty (all zero) label array
labels[np.arange(arr.size), arr] = 1 # add ones in indices where we have a value`
I have tried to define the padding tokens as the solution tells me, but then this error appears
could not broadcast input array from shape (3,) into shape (50,)
in line
Xids[i, :], Xmask[i, :] = tokenize(sentence)
I have also tried this solution, and it doesn't work either.
If you have managed to read until here, thank you.
Any help is needed.
Upvotes: 20
Views: 49567
Reputation: 1
In my code, the below is working!
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.unk_token if tokenizer.unk_token else tokenizer.eos_token
Upvotes: 0
Reputation: 1
tokenizer.eos_token_id = 151646
tokenizer.pad_token_id = 151645
tokenizer.bos_token_id = 151648
Upvotes: -1
Reputation: 69
You can also try to assigning the eos_token
(end-of-sentence token) to the pad_token
.
tokenizer.pad_token = tokenizer.eos_token
Upvotes: 4
Reputation: 475
kkgarg idea was right, but you also need to update your model token embeding size. So, the code will be:
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
model = TFAutoModel.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
Check this related issue.
Upvotes: 27
Reputation: 1376
You can add the [PAD]
token using add_special_tokens
API.
tokenizer = AutoTokenizer.from_pretrained(pretrained_weights)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
Upvotes: 10