Apply embedding layer for categorical variable with keras

Question

I have a dataset with many categorical features and many features.I want to apply embedding layer to transfer the categorical data to numerical data for the using of the other models.But, I got some error during training. Now, my training process is:

Perform label encoder to categorical features
Split training and testing data by train_test_split() function
Drop the numerical columns. Only send the categorical features and target y for model training.

And I got this error:

    indices[13,0] = 10 is not in [0, 10)
     [[node functional_1/embed_6/embedding_lookup (defined at :4) ]] [Op:__inference_train_function_3509]

Errors may have originated from an input operation.
Input Source operations connected to node functional_1/embed_6/embedding_lookup:
 functional_1/embed_6/embedding_lookup/2395 (defined at /usr/lib/python3.6/contextlib.py:81)

Function call stack:
train_function

After searching, someone says the problem is that the vocabulary_size parameter of embedding layer is wrong. Enlarge the vocabulary_size can solve this problem. But in my case, I need to map the result back to original label.

For example, I have a categorical feature ['dog', 'cat', 'fish']. After label encode, it become[0,1,2]. An embedding layer for this feature with 3 unique variable should output something like ([-0.22748041], [-0.03832678], [-0.16490786]). Then I can replace the ['dog'] variable in original data as -0.22748041, replace ['cat'] variable as -0.03832678, and so on. So, I can't change the vocabulary_size or the output dimension will be wrong.

I guess the problem in my case is that not all of the categorical variable are go into the training process. (E.x. Only ['dog', 'fish'] are in the training data. ['cat'] is only appear in testing data). If I set the vocabulary_size as 3, it will report an error like above. If I experimentally add ['cat'] to training data. It works fine.

My problem is, dose embedding layer have to look all of the unique value in training process to perform the application I want? If there are a lot of categorical data with a lot of unique value, how to ensure all the unique value appear in testing data when splitting data.

Thanks in advance!

Apply embedding layer for categorical variable with keras

Answers (1)

Solution

What the solution does?

Related Questions