Gamuza
Gamuza

Reputation: 93

Apply embedding layer for categorical variable with keras

I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training. Now, my training process is:

  1. Perform label encoder to categorical features
  2. Split training and testing data by train_test_split() function
  3. Drop the numerical columns. Only send the categorical features and target y for model training.

And I got this error:

    indices[13,0] = 10 is not in [0, 10)
     [[node functional_1/embed_6/embedding_lookup (defined at <ipython-input-34-0b6b3ae455d0>:4) ]] [Op:__inference_train_function_3509]

Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
 functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)

Function call stack:
train_function

After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem. But in my case, I need to map the result back to original label.

For example, I have a categorical feature ['dog', 'cat', 'fish']. After label encode, it become[0,1,2]. An embedding layer for this feature with 3 unique variable should output something like ([-0.22748041], [-0.03832678], [-0.16490786]). Then I can replace the ['dog'] variable in original data as -0.22748041, replace ['cat'] variable as -0.03832678, and so on. So, I can't change the vocabulary_size or the output dimension will be wrong.

I guess the problem in my case is that not all of the categorical variable are go into the training process. (E.x. Only ['dog', 'fish'] are in the training data. ['cat'] is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat'] to training data. It works fine.

My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.

Thanks in advance!

Upvotes: 1

Views: 2730

Answers (1)

Anurag Dhadse
Anurag Dhadse

Reputation: 1873

Solution

You need to use out-of-vocabulary buckets when creating the the lookup table. oov buckets allow to lookup of unknown category if found during testing.

What the solution does?

Setting it to a required number (like 1000) will allow you to get ids of those other category as well which were not present in test data categories.

words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)

# important
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) # lokup table for ids->category

Then you can encode the training set (I am using TensorFlow Dataset IMDb rating dataset)

def encode_words(X_batch, y_batch):
  """
  Encode the training set converting words to IDs
  using the lookup table just created
  """
  return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

when creating model:

vocab_size=10000    # whatever the length of variable vocabulary is of
embedding_size = 128  # tweakable | hyperparameter
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embedding_size, 
                           input_shape=[None]),
   # usual code follows
])

and fit the data

model.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics="accuracy")
history = model.fit(train_set, epochs=5)

Upvotes: 1

Related Questions