Reputation: 13
i was decoding the tokenized tokens from bert tokenizer and it was giving [UNK] for € symbol. but i tried by add ##€ token in vocab.txt file. but it was not reflected in prediction result was same as previous it was giving [UNK] again. please let me know to solve this problem did i need to fine tune the model for again to reflect the changes in prediction. till now i was avoiding fine tuning again because it takes more than 10 hours. Thanks in advance
Upvotes: 1
Views: 3264
Reputation: 19520
Use the add_tokens function of the tokenizer to avoid unknown tokens:
from transformers import BertTokenizer
t = BertTokenizer.from_pretrained('bert-base-uncased')
print(t.tokenize("This is an example with an emoji 🤗."))
t.add_tokens(['🤗'])
print(t.tokenize("This is an example with an emoji 🤗."))
Output:
['this', 'is', 'an', 'example', 'with', 'an', 'em', '##oj', '##i', '[UNK]', '.']
['this', 'is', 'an', 'example', 'with', 'an', 'em', '##oj', '##i', '🤗', '.']
Please keep in mind that you also need to resize your model to introduce this to the new token with resize_token_embeddings:
model.resize_token_embeddings(len(t))
Upvotes: 3