Anurag Sharma
Anurag Sharma

Reputation: 5039

Problem with batch_encode_plus method of tokenizer

I am encountering a strange issue in the batch_encode_plus method of the tokenizers. I have recently switched from transformer version 3.3.0 to 4.5.1. (I am creating my databunch for NER).

I have 2 sentences whom I need to encode, and I have a case where the sentences are already tokenized, but since both the sentences differs in length so I need to pad [PAD] the shorter sentence in order to have my batch of uniform lengths.

Here is the code below of I did with 3.3.0 version of transformers

from transformers import AutoTokenizer

pretrained_model_name = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True)

sentences = ["He is an uninvited guest.", "The host of the party didn't sent him the invite."]

# here we have the complete sentences
encodings = tokenizer.batch_encode_plus(sentences, max_length=20, padding=True)
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))

# And the output
# [101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0]
# ['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

# here we have the already tokenized sentences
encodings = tokenizer.batch_encode_plus(batch_token_ids, max_length=20, padding=True, truncation=True, is_split_into_words=True, add_special_tokens=False, return_tensors="pt")

batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0])) 

# And the output 
tensor([ 101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648,  119,  102, 0, 0, 0, 0])
['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', [PAD]', '[PAD]', '[PAD]']

But if I try to mimic the same behavior in transformer version 4.5.1, I get different output

from transformers import AutoTokenizer
    
pretrained_model_name = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True)

sentences = ["He is an uninvited guest.", "The host of the party didn't sent him the invite."]

# here we have the complete sentences
encodings = tokenizer.batch_encode_plus(sentences, max_length=20, padding=True)
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))

# And the output
#[101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0]
#['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

# here we have the already tokenized sentences, Note we cannot pass the batch_token_ids 
# to the batch_encode_plus method in the newer version, so need to convert them to token first
tokens1 = tokenizer.tokenize(sentences[0], add_special_tokens=True)
tokens2 = tokenizer.tokenize(sentences[1], add_special_tokens=True)

encodings = tokenizer.batch_encode_plus([tokens1, tokens2], max_length=20, padding=True, truncation=True, is_split_into_words=True, add_special_tokens=False, return_tensors="pt")

batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))

# And the output (not the desired one)
tensor([  101,  1124,  1110,  1126,  8362,   108,   108,  1107,   108,   108,
          191,  1182,   108,   108, 21359,  1181,  3648,   119,   102])
['[CLS]', 'He', 'is', 'an', 'un', '#', '#', 'in', '#', '#', 'v', '##i', '#', '#', 'te', '##d', 'guest', '.', '[SEP]']

Not sure how to handle this, or what I am doing wrong here.

Upvotes: 0

Views: 9007

Answers (2)

kkgarg
kkgarg

Reputation: 1376

You need a non-fast tokenizer to use list of integer tokens.

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True, use_fast=False)

use_fast flag has been enabled by default in later versions.

From the HuggingFace documentation,

batch_encode_plus(batch_text_or_text_pairs: ...)

batch_text_or_text_pairs (List[str], List[Tuple[str, str]], List[List[str]], List[Tuple[List[str], List[str]]], and for not-fast tokenizers, also List[List[int]], List[Tuple[List[int], List[int]]])

Upvotes: 5

A.M. Ducu
A.M. Ducu

Reputation: 900

I am writing here because I am unable to comment on the question itself. I suggest looking at the output of each tokenization (token1 and token2) and compare it to batch_token_ids. It's weird the output does not contain tokens from the second sentence. Maybe there is an issue there.

Upvotes: 1

Related Questions