Reputation: 93
I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training. Now, my training process is:
And I got this error:
indices[13,0] = 10 is not in [0, 10)
[[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]
Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem. But in my case, I need to map the result back to original label.
For example, I have a categorical feature ['dog', 'cat', 'fish']
. After label encode, it become[0,1,2]
. An embedding layer for this feature with 3 unique variable should output something like
([-0.22748041], [-0.03832678], [-0.16490786])
.
Then I can replace the ['dog']
variable in original data as -0.22748041
, replace ['cat']
variable as -0.03832678
, and so on.
So, I can't change the vocabulary_size or the output dimension will be wrong.
I guess the problem in my case is that not all of the categorical variable are go into the training process.
(E.x. Only ['dog', 'fish']
are in the training data. ['cat']
is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat']
to training data. It works fine.
My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.
Thanks in advance!
Upvotes: 1
Views: 2730
Reputation: 1873
You need to use out-of-vocabulary buckets when creating the the lookup table. oov buckets allow to lookup of unknown category if found during testing.
Setting it to a required number (like 1000) will allow you to get ids of those other category as well which were not present in test data categories.
words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)
# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) # lokup table for ids->category
Then you can encode the training set (I am using TensorFlow Dataset IMDb rating dataset)
def encode_words(X_batch, y_batch):
"""
Encode the training set converting words to IDs
using the lookup table just created
"""
return table.lookup(X_batch), y_batch
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)
when creating model:
vocab_size=10000 # whatever the length of variable vocabulary is of
embedding_size = 128 # tweakable | hyperparameter
model = keras.models.Sequential([
keras.layers.Embedding(vocab_size + num_oov_buckets, embedding_size,
input_shape=[None]),
# usual code follows
])
and fit the data
model.compile(loss="binary_crossentropy",
optimizer="adam",
metrics="accuracy")
history = model.fit(train_set, epochs=5)
Upvotes: 1