JayJay
JayJay

Reputation: 203

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.

from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"

tokens = tz.tokenize(sentence)
print(tokens)

>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']

What I want is to get the text corresponding to 4 tokens to the left and to the right of the token Madrid. So i want the tokens: ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] and then transform them into the original text. In this case it would be 'Natural Science Museum of Madrid shows the REC'.

Is there a way to do this?

Upvotes: 7

Views: 10260

Answers (2)

cronoik
cronoik

Reputation: 19520

In addition to the information provided by Jindrich about the information loss, I want to add that huggingface provides a build-in method to convert tokens to a string (the lost information remains lost!). The method is called convert_tokens_to_string:

tz.convert_tokens_to_string(tokens[1:10])

Output:

'Natural Science Museum of Madrid shows the REC'

Upvotes: 13

Jindřich
Jindřich

Reputation: 11213

BERT uses word-piece tokenization that is unfortunately not loss-less, i.e., you are never guaranteed to get the same sentence after detokenization. This is a big difference from RoBERTa that uses SentencePiece that is fully revertable.

You can get the so-called pre-tokenized text where merging tokens starting with ##.

pretok_sent = ""
for tok in tokens:
     if tok.startswith("##"):
         pretok_sent += tok[2:]
     else:
         pretok_sent += " " + tok
pretok_sent = pretok_sent[1:]

This snippet reconstructs the sentence in your example, but note that if the sentence would contain punctuation, the punctuation will remain separated from the other tokens, which is the pre-tokenization. The sentence can look like this:

'This is a sentence ( with brackets ) .'

Going from the pre-tokenized to a standard sentence is the lossy step (you can never know if and how many extra spaces were in the original sentence). You can get a standard sentence by applying detokenization rules, such as in sacremoses.

import sacremoses
detok = sacremoses.MosesDetokenizer('en')
detok(sent.split(" "))

This results in:

'This is a sentence (with brackets).'

Upvotes: 5

Related Questions