Yu Gu
Yu Gu

Reputation: 2513

What's the point to have a UNK token for out of vocabulary words during decoding?

First of all, I know this question is kind of off-topic, but I have already tried to ask elsewhere but got no response.

Adding a UNK token to the vocabulary is a conventional way to handle oov words in tasks of NLP. It is totally understandable to have it for encoding, but what's the point to have it for decoding? I mean you would never expect your decoder to generate a UNK token during prediction, right?

Upvotes: 2

Views: 2732

Answers (2)

Jindřich
Jindřich

Reputation: 11213

Depending on how you preprocess your training data, you might need the UNK during training. Even if you use BPE or other subword segmentation, OOV can appear in the training data, usually some weird UTF-8 stuff, fragments of alphabets, you are not interested in at all, etc.

For example, if you take WMT training data for English-German translation, do BPE and take the vocabulary, you vocabulary will contain thousands of Chinese characters that occur exactly once in the training data. Even if you keep them in the vocabulary, the model has no chance to learn anything about them, not even to copy them. It makes sense to represent them as UNKs.

Of course, what you usually do at the inference time is that you prevent the model predict UNK tokens, UNK is always incorrect.

Upvotes: 3

Bruno Mello
Bruno Mello

Reputation: 4618

I have used it one time in the following situation:

I had a preprocessed word2vec(glove.6b.50d.txt) and I was outputting an embedded vector, in order to transform it into a word I used cosine similarity based on all vectors in the word2vec if the most similar vector was the I would output it.

Maybe I'm just guessing it here, but what I think might happen under the hoods is that it predicts based on previous words(e.g. it predicts the word that appeared 3 iterations ago) and if that word is the neural net outputs it.

Upvotes: 0

Related Questions