Andrea NR
Andrea NR

Reputation: 1687

Split a sentence by words just as BERT Tokenizer would do?

I'm trying to localize all the [UNK] tokens of BERT tokenizer on my text. Once I have the position of the UNK token, I need to identify what word it belongs to. For that, I tried to get the position of the word using words_ids() or token_to_words() methods (the result is the same, I think) which give me the id word of this token.

The problem is, for a large text, there are many ways to split the text by words, and the ways I tried don't match with the position I get from token_to_words method. How I can split my text in the same way Bert tokenizer do?

I saw BERT use WordPiece for tokenize in sub-words, but nothing for complete words.

I'm at this point:

  tokenized_text = tokenizer.tokenize(texto) # Tokens
  encoding_text = tokenizer(texto) # Esto es de tipo batchEncoding, como una instancia del tokenizer
  tpos = [i for i, element in enumerate(tokenized_text) if element == "[UNK]"]  # Posicion en la lista de tokens

  word_list = texto.split(" ")
  for x in tpos:
    wpos = encoding_text.token_to_word(x) # Posicion en la lista de palabras
    print("La palabra:  ", word_list[wpos], "    contiene un token desconocido: ", tokenizer.tokenize(word_list[wpos]))

but it fails because the index "wpos" doesn't fit properly with my word_list.

Upvotes: 0

Views: 1953

Answers (1)

Andrea NR
Andrea NR

Reputation: 1687

The problem is solved with token_to_chars() method as @cronoik proposed in comments. It gives me the exact position (and it is universal, not like words I used before which depends on how is splited) of any token, even UNK.

Upvotes: 1

Related Questions