Reputation: 20445
I am working with keras embedding and using Keras tokenizer. At first, I wasnt using oov_token (for unknown token) and I was having length of my tokenizer's word_counts
as 54
.
For embedding I used to give my_tokenizer.word_counts+1
as input_dim
, later I had a need to tackle unknown tokens, so I change my code to following
my_tokenizer = Tokenizer(oov_token="<UKN>") # First it was Tokenizer()
my_tokenizer.fit_on_texts(my_tokens)
my_sequences = module_tokenizer.texts_to_sequences(my_tokens)
But after adding the unknown token (which got index 1 as {'<UNK':1,..}
) with my_tokenizer.word_counts+1
as input_dim
, I got error for index like
55 not in indices [0,55]
My word_count(my_tokenizer.word_counts+1
) is 55
(one extra than last(54) approach without unkonwn token).
Now if I add 2
in my vocabulary size (my_tokenizer.word_counts
) it works fine
layers.Embedding( my_tokenizer.word_counts+2 , ... )
but I dont understand why I need to add 2 (make it 56
)
I would be very thankful for the help
Upvotes: 1
Views: 1014
Reputation: 36584
You need to add +2 for the OOV token, as you mentioned, and also the padding.
Upvotes: 3