Reputation: 1549
I'm currently doing a tensorflow transformer tutorial for sequence to sequence translation. At the beginning of the tutorial the class tfds.features.text.SubwordTextEncoder is called. This class can be used to convert a string to a list with integers, each representing a word.
After using the class SubwordTextEncoder
to train an english tokenizer as follows:
tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
the tutorial shows how this tokenizer can now be used to convert strings to lists with integers. This code snippet
sample_string = 'Transformer is awesome.'
tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
gives the following result:
[7915, 1248, 7946, 7194, 13, 2799]
where the integer to word mapping can be shown as follows:
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
returns
7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former
13 ----> is
2799 ----> awesome
This all makes sense to me. The tokenizer recognises the words 'is' and 'awesome' from its training set and assigns the corresponding integers. The word 'Transformer' which was not in its training set is being split up into parts as is mentioned in the documentation.
After some experimenting with the tokenizer however, I got confused. Please consider the following code snippets
sample_string2 = 'the best there is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
which returns
[3, 332, 64, 156]
and
for ts in tokenized_string2:
print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))
which returns
3 ----> the
332 ----> best
64 ----> there
156 ----> is
Question: Why does the tokenizer return different integers for the same word if they are in a different part of the sentence? The word 'is' maps to 156 in the second example, where in the first example it is mapped to the integer 13, using the same tokenizer.
Upvotes: 2
Views: 1849
Reputation:
I have added one more statement len(tokenizer_en.decode([ts])
in the print
statement to see length and I tried the below example -
Example:
sample_string2 = 'is is is is is is'
tokenized_string2 = tokenizer_en.encode(sample_string2)
print(tokenized_string2)
for ts in tokenized_string2:
print ('{} ----> {} ----> {}'.format(ts, tokenizer_en.decode([ts]),len(tokenizer_en.decode([ts]))))
Output -
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
13 ----> is ----> 3
156 ----> is ----> 2
As per the documentation of arguments, it states -
vocab_list -
list<str>
, list of subwords for the vocabulary. Note that an underscore at the end of a subword indicates the end of the word (i.e. a space will be inserted afterwards when decoding). Underscores in the interior of subwords are disallowed and should use the underscore escape sequence.
Upvotes: 1