tf.keras how to handle a categorical column which has a list of values of variable length

Question

I am using tf.keras to build my model. Normally I use tf.keras.layers.Embedding layer for handling categorical data. For e.g if one of the input columns is in the following format

App

fb
whatsapp
instagram

With above data, I label encode the data and pass it through the Embedding layer as below.

inp = tf.keras.Input(shape=(1,), name="app_input")
emb_layer = tf.keras.layers.Embedding(vocab_size, embedding_size, input_length=1, name="emb_" + feature)(inp)

But what if I a column has multiple values in each row? For e.g the data is in the following format:

Apps

[fb, whatsapp]
[whatsapp, instagram, fb]

I can't use one-hot encoding because the number of unique apps is huge. I want to generate embeddings for these app but not sure how to handle the above data.

ags29 · Accepted Answer

One approach (and one that is commonly used) is to choose some fixed upper bound on the length of your input sequences and then pad the sequences that are shorter than this maximum with an additional "null" element, using e.g. tf.keras.preprocessing.sequence.pad_sequences.

Then you will use the padded sequences as inputs to an embedding layer emb_layer = tf.keras.layers.Embedding(vocab_size, embedding_size, input_length=max_len), where max_len is the upper bound referred to above.

tf.keras how to handle a categorical column which has a list of values of variable length

Answers (1)

Related Questions