A.B
A.B

Reputation: 20445

Why keras Tokenizer with unknown token requiring embedding's input_dim to be vocab_size +2, instead of vocal_size+1

I am working with keras embedding and using Keras tokenizer. At first, I wasnt using oov_token (for unknown token) and I was having length of my tokenizer's word_counts as 54.

For embedding I used to give my_tokenizer.word_counts+1 as input_dim, later I had a need to tackle unknown tokens, so I change my code to following

my_tokenizer = Tokenizer(oov_token="<UKN>") # First it was Tokenizer()    
my_tokenizer.fit_on_texts(my_tokens)
my_sequences = module_tokenizer.texts_to_sequences(my_tokens)

But after adding the unknown token (which got index 1 as {'<UNK':1,..} ) with my_tokenizer.word_counts+1 as input_dim, I got error for index like

55 not in indices [0,55]

My word_count(my_tokenizer.word_counts+1) is 55 (one extra than last(54) approach without unkonwn token).

Now if I add 2 in my vocabulary size (my_tokenizer.word_counts) it works fine

layers.Embedding( my_tokenizer.word_counts+2 , ... )

but I dont understand why I need to add 2 (make it 56)

I would be very thankful for the help

Upvotes: 1

Views: 1014

Answers (1)

Nicolas Gervais
Nicolas Gervais

Reputation: 36584

You need to add +2 for the OOV token, as you mentioned, and also the padding.

Upvotes: 3

Related Questions