Reputation: 7358
I am using tf.keras
to build my model. Normally I use tf.keras.layers.Embedding
layer for handling categorical data. For e.g if one of the input columns is in the following format
App
fb
whatsapp
instagram
With above data, I label encode the data and pass it through the Embedding layer as below.
inp = tf.keras.Input(shape=(1,), name="app_input")
emb_layer = tf.keras.layers.Embedding(vocab_size, embedding_size, input_length=1, name="emb_" + feature)(inp)
But what if I a column has multiple values in each row? For e.g the data is in the following format:
Apps
[fb, whatsapp]
[whatsapp, instagram, fb]
I can't use one-hot encoding because the number of unique apps is huge. I want to generate embeddings for these app but not sure how to handle the above data.
Upvotes: 0
Views: 1132
Reputation: 2696
One approach (and one that is commonly used) is to choose some fixed upper bound on the length of your input sequences and then pad the sequences that are shorter than this maximum with an additional "null" element, using e.g. tf.keras.preprocessing.sequence.pad_sequences
.
Then you will use the padded sequences as inputs to an embedding layer
emb_layer = tf.keras.layers.Embedding(vocab_size, embedding_size, input_length=max_len)
, where max_len
is the upper bound referred to above.
Upvotes: 1