Reputation: 203
I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"
tokens = tz.tokenize(sentence)
print(tokens)
>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']
What I want is to get the text corresponding to 4 tokens to the left and to the right of the token Madrid. So i want the tokens: ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] and then transform them into the original text. In this case it would be 'Natural Science Museum of Madrid shows the REC'.
Is there a way to do this?
Upvotes: 7
Views: 10260
Reputation: 19520
In addition to the information provided by Jindrich about the information loss, I want to add that huggingface provides a build-in method to convert tokens to a string (the lost information remains lost!). The method is called convert_tokens_to_string:
tz.convert_tokens_to_string(tokens[1:10])
Output:
'Natural Science Museum of Madrid shows the REC'
Upvotes: 13
Reputation: 11213
BERT uses word-piece tokenization that is unfortunately not loss-less, i.e., you are never guaranteed to get the same sentence after detokenization. This is a big difference from RoBERTa that uses SentencePiece that is fully revertable.
You can get the so-called pre-tokenized text where merging tokens starting with ##
.
pretok_sent = ""
for tok in tokens:
if tok.startswith("##"):
pretok_sent += tok[2:]
else:
pretok_sent += " " + tok
pretok_sent = pretok_sent[1:]
This snippet reconstructs the sentence in your example, but note that if the sentence would contain punctuation, the punctuation will remain separated from the other tokens, which is the pre-tokenization. The sentence can look like this:
'This is a sentence ( with brackets ) .'
Going from the pre-tokenized to a standard sentence is the lossy step (you can never know if and how many extra spaces were in the original sentence). You can get a standard sentence by applying detokenization rules, such as in sacremoses.
import sacremoses
detok = sacremoses.MosesDetokenizer('en')
detok(sent.split(" "))
This results in:
'This is a sentence (with brackets).'
Upvotes: 5