Ramakant Shakya
Ramakant Shakya

Reputation: 13

how to add tokens in vocab.txt which decoded as [UNK] bert tokenizer

i was decoding the tokenized tokens from bert tokenizer and it was giving [UNK] for € symbol. but i tried by add ##€ token in vocab.txt file. but it was not reflected in prediction result was same as previous it was giving [UNK] again. please let me know to solve this problem did i need to fine tune the model for again to reflect the changes in prediction. till now i was avoiding fine tuning again because it takes more than 10 hours. Thanks in advance

Upvotes: 1

Views: 3264

Answers (1)

cronoik
cronoik

Reputation: 19520

Use the add_tokens function of the tokenizer to avoid unknown tokens:

from transformers import BertTokenizer
t = BertTokenizer.from_pretrained('bert-base-uncased')
print(t.tokenize("This is an example with an emoji 🤗."))
t.add_tokens(['🤗'])
print(t.tokenize("This is an example with an emoji 🤗."))

Output:

['this', 'is', 'an', 'example', 'with', 'an', 'em', '##oj', '##i', '[UNK]', '.']
['this', 'is', 'an', 'example', 'with', 'an', 'em', '##oj', '##i', '🤗', '.']

Please keep in mind that you also need to resize your model to introduce this to the new token with resize_token_embeddings:

model.resize_token_embeddings(len(t))

Upvotes: 3

Related Questions