Get embedding vectors from Embedding Column in Tensorflow

Question

I want to get the numpy vectors created using the "Embedding Column" in Tensorflow.

For example, creating a sample DF:

sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds

Converting the pandas DF to Tensorflow object

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):

    dataframe = dataframe.copy()
    labels = dataframe.pop('B')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    #print (ds)
    if shuffle:
       ds = ds.shuffle(buffer_size=len(dataframe))
    #print (ds)
    ds = ds.batch(batch_size)
    return ds

Creating a embedding column:

tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
  'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)

Is there anyway to get the embeddings as numpy vectors from the 'col_a_embedding' object?

Example,

The category "Apple" will be embedded into a vector size 8:

[a1 a2 a3 a4 a5 a6 a7 a8]

Can we fetch that vector?

greeness · Accepted Answer

I don't see a way to get what you want using feature columns (I dont see a function named sequence_embedding_column or similar in the available functions in tf.feature_column). Because the result from feature columns seem to be a fixed-length tensor. They achieve that by using a combiner to aggregate individual embedding vectors (sum, mean, sqrtn etc). So the dimension on the sequence of categories are actually lost.

But it's totally doable if you use lower-level apis. First you could construct a lookup table to convert categorical strings to ids.

features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
    vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)

#Content of "fruit.txt"
apple
mango
banana
unknown

Now you could initialize the embedding as a 2d variable. Its shape is [number of categories, embedding dimension].

num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
                "embedding_table", [num_categories, embedding_dim],
                initializer=tf.truncated_normal_initializer(stddev=0.02))

You could then lookup category embedding like below:

ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)

Note the results in ids_embeddings is a concatenated long tensor. Feel free to reshape it to the shape you want.

Get embedding vectors from Embedding Column in Tensorflow

Answers (2)

Related Questions