Reputation: 150
I am getting into NLP and neural machine translation. I understand how sentencepiece etc can translate a work into subwords, and subwords into token IDs. But these token IDs are just integers representing the subword token. How do these IDs actually get used with NLP models?
Upvotes: 1
Views: 936
Reputation: 1231
The token ids are indices in a vocabulary, in your case indices in a sub-word vocabulary.
The ids themselves are not used during the training of a network, rather the ids are transformed into vectors.
Say you are inputting three words, and their ids are 12,14, and 4. What is actually is given as input is three vectors (say each of n-dimension) where each id is mapped to a unique vector. These vectors could be one-hot, i.e 1 at the index 4 for the token Id 4 and rest zeros, or they could be pre-trained embedding like GloVe.
Upvotes: 2