How does the `tfds.features.text.SubwordTextEncoder` create word encoding?

Question

I'm currently doing a tensorflow transformer tutorial for sequence to sequence translation. At the beginning of the tutorial the class tfds.features.text.SubwordTextEncoder is called. This class can be used to convert a string to a list with integers, each representing a word.

After using the class SubwordTextEncoder to train an english tokenizer as follows:

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

the tutorial shows how this tokenizer can now be used to convert strings to lists with integers. This code snippet

sample_string = 'Transformer is awesome.'

tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

gives the following result:

[7915, 1248, 7946, 7194, 13, 2799]

where the integer to word mapping can be shown as follows:

for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))

returns

7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former 
13 ----> is 
2799 ----> awesome

This all makes sense to me. The tokenizer recognises the words 'is' and 'awesome' from its training set and assigns the corresponding integers. The word 'Transformer' which was not in its training set is being split up into parts as is mentioned in the documentation.

After some experimenting with the tokenizer however, I got confused. Please consider the following code snippets

sample_string2 = 'the best there is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)

which returns

[3, 332, 64, 156]

and

for ts in tokenized_string2:
  print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))

which returns

3 ----> the 
332 ----> best 
64 ----> there 
156 ----> is

Question: Why does the tokenizer return different integers for the same word if they are in a different part of the sentence? The word 'is' maps to 156 in the second example, where in the first example it is mapped to the integer 13, using the same tokenizer.

user11530462 · Accepted Answer

I have added one more statement len(tokenizer_en.decode([ts]) in the print statement to see length and I tried the below example -

Example:

sample_string2 = 'is is is is is is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)

for ts in tokenized_string2:
  print ('{} ----> {} ----> {}'.format(ts, tokenizer_en.decode([ts]),len(tokenizer_en.decode([ts]))))

Output -

13 ----> is  ----> 3
13 ----> is  ----> 3
13 ----> is  ----> 3
13 ----> is  ----> 3
13 ----> is  ----> 3
156 ----> is ----> 2

As per the documentation of arguments, it states -

vocab_list - list, list of subwords for the vocabulary. Note that an underscore at the end of a subword indicates the end of the word (i.e. a space will be inserted afterwards when decoding). Underscores in the interior of subwords are disallowed and should use the underscore escape sequence.

How does the `tfds.features.text.SubwordTextEncoder` create word encoding?

Answers (1)

Related Questions