Fast and slow tokenizers yield different results

Question

Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer.

Specifically, when I run the fill-mask pipeline, the probabilities assigned to the words that would fill in the mask are not the same for the fast and slow tokenizer. Moreover, while the predictions of the fast tokenizer remain constant regardless of the number and length of sentences input, the same is not true for the slow tokenizer.

Here's a minimal example:

from transformers import pipeline

slow = pipeline('fill-mask', model='bert-base-cased', \
                tokenizer=('bert-base-cased', {"use_fast": False}))

fast = pipeline('fill-mask', model='bert-base-cased', \
                tokenizer=('bert-base-cased', {"use_fast": True}))

s1 = "This is a short and sweet [MASK]."  # "example"
s2 = "This is [MASK]."  # "shorter"

slow([s1, s2])
fast([s1, s2])
slow([s2])
fast([s2])

Each pipeline call yields the top-5 tokens that could fill in for [MASK], along with their probabilities. I've ommitted the actual outputs for brevity, but the probabilities assigned to each word that fill in [MASK] for s2 are not the same across all of the examples. The final 3 examples give the same probabilities, but the first yields different probabilities. The differences are so great that the top-5 are not consistent across the two groups.

The cause behind this, as I can tell, is that the fast and slow tokenizers return different outputs. The fast tokenizer standardizes sequence length to 512 by padding with 0s, and then creates an attention mask that blocks out the padding. In contrast, the slow tokenizer only pads to the length of the longest sequence, and does not create such an attention mask. Instead, it sets the token type id of the padding to 1 (rather than 0, which is the type of the non-padding tokens). By my understanding of HuggingFace's implementation (found here), these are not equivalent.

Does anyone know if this is intentional?

Fast and slow tokenizers yield different results

Answers (1)

Related Questions