andy
andy

Reputation: 2161

what's difference between tokenizer.encode and tokenizer.encode_plus in Hugging Face

Here is an example of doing sequence classification using a model to determine if two sequences are paraphrases of each other. The two examples give two different results. Can you help me explain why tokenizer.encode and tokenizer.encode_plus give different results?

Example 1 (with .encode_plus()):

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

Example 2 (with .encode()):

paraphrase = tokenizer.encode(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]

Upvotes: 45

Views: 67879

Answers (2)

Oscar Rangel
Oscar Rangel

Reputation: 1056

The tokenizer.encode_plus function combines multiple steps for us:

1.- Split the sentence into tokens. 2.- Add the special [CLS] and [SEP] tokens. 3.- Map the tokens to their IDs. 4.- Pad or truncate all sentences to the same length. 5.- Create the attention masks which explicitly differentiate real tokens from [PAD] tokens.

Documentation is here

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )

Upvotes: 12

dennlinger
dennlinger

Reputation: 11488

The main difference is stemming from the additional information that encode_plus is providing. If you read the documentation on the respective functions, then there is a slight difference forencode():

Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

and the description of encode_plus():

Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a max_length is specified.

Depending on your specified model and input sentence, the difference lies in the additionally encoded information, specifically the input mask. Since you are feeding in two sentences at a time, BERT (and likely other model variants), expect some form of masking, which allows the model to discern between the two sequences, see here. Since encode_plus is providing this information, but encode isn't, you get different output results.

Upvotes: 49

Related Questions