Reputation: 1687
I'm trying to localize all the [UNK] tokens of BERT tokenizer on my text. Once I have the position of the UNK token, I need to identify what word it belongs to. For that, I tried to get the position of the word using words_ids() or token_to_words() methods (the result is the same, I think) which give me the id word of this token.
The problem is, for a large text, there are many ways to split the text by words, and the ways I tried don't match with the position I get from token_to_words method. How I can split my text in the same way Bert tokenizer do?
I saw BERT use WordPiece for tokenize in sub-words, but nothing for complete words.
I'm at this point:
tokenized_text = tokenizer.tokenize(texto) # Tokens
encoding_text = tokenizer(texto) # Esto es de tipo batchEncoding, como una instancia del tokenizer
tpos = [i for i, element in enumerate(tokenized_text) if element == "[UNK]"] # Posicion en la lista de tokens
word_list = texto.split(" ")
for x in tpos:
wpos = encoding_text.token_to_word(x) # Posicion en la lista de palabras
print("La palabra: ", word_list[wpos], " contiene un token desconocido: ", tokenizer.tokenize(word_list[wpos]))
but it fails because the index "wpos" doesn't fit properly with my word_list.
Upvotes: 0
Views: 1953
Reputation: 1687
The problem is solved with token_to_chars()
method as @cronoik proposed in comments. It gives me the exact position (and it is universal, not like words I used before which depends on how is splited) of any token, even UNK.
Upvotes: 1