Reputation: 79
I'm doing a sequence labeling task with Bert. In order to align the word pieces with labels, I need the some marker to identify them so I can get an single embedding for each word by either summing or averaging.
For example I want the word New~york
tokenized into New ##~ ##york
, and looking at some old examples on the internet, that was what you get by using BertTokenizer before, but clearly not anymore (Says their documentation)
So when I run:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer(batch_sentences, return_tensors="pt")
decoded = tokenizer.decode(inputs["input_ids"][0])
print(decoded)
and I get:
[CLS] hello, i'm testing this efauenufefu [SEP]
But the encoding clear suggesting otherwise that the nonsense at the end was indeed broken up into pieces...
In [4]: inputs
Out[4]:
{'input_ids': tensor([[ 101, 19082, 117, 178, 112, 182, 5193, 1142, 174, 8057,
23404, 16205, 11470, 1358, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
I also tried to use the BertTokenizerFast
, which unlike the BertTokenizer
, it allows you to specify wordpiece prefix:
tokenizer2 = BertTokenizerFast("bert-base-cased-vocab.txt", wordpieces_prefix = "##")
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer2(batch_sentences, return_tensors="pt")
decoded = tokenizer2.decode(inputs["input_ids"][0])
print(decoded)
Yet the decoder gave me exactly the same...
[CLS] hello, i'm testing this efauenufefu [SEP]
So, is there a way to use the pretrained Huggingface tokenizer with prefix, or must I train a custom tokenizer myself?
Upvotes: 2
Views: 1542
Reputation: 19520
Maybe you are looking for tokenize:
from transformers import BertTokenizerFast
t = BertTokenizerFast.from_pretrained('bert-base-uncased')
t.tokenize("hello, i'm testing this efauenufefu")
Output:
['hello',
',',
'i',
"'",
'm',
'testing',
'this',
'e',
'##fa',
'##uen',
'##uf',
'##ef',
'##u']
You can also get a mapping of each token to the respecting word and other things:
o = t("hello, i'm testing this efauenufefu", add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False)
o.words()
Output:
[0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7]
Upvotes: 3