data_person
data_person

Reputation: 4470

Get embedding vectors from Embedding Column in Tensorflow

I want to get the numpy vectors created using the "Embedding Column" in Tensorflow.

For example, creating a sample DF:

sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds

Converting the pandas DF to Tensorflow object

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):

    dataframe = dataframe.copy()
    labels = dataframe.pop('B')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    #print (ds)
    if shuffle:
       ds = ds.shuffle(buffer_size=len(dataframe))
    #print (ds)
    ds = ds.batch(batch_size)
    return ds

Creating a embedding column:

tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
  'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)

Is there anyway to get the embeddings as numpy vectors from the 'col_a_embedding' object?

Example,

The category "Apple" will be embedded into a vector size 8:

[a1 a2 a3 a4 a5 a6 a7 a8]

Can we fetch that vector?

Upvotes: 1

Views: 2445

Answers (2)

Geoffrey Anderson
Geoffrey Anderson

Reputation: 1564

I suggest the easiest fastest way is to do like this, which is what I am doing in my own app:

  1. Use pandas to read_csv your file into a string column of type "category" in pandas using the dtype parameter. Let's call it field "f". This is the original string column, not a numerical column yet.

    Still in pandas, create a new column and copy the original column's pandas cat.codes into the new column. Let's call it field "f_code". Pandas automatically encodes this into a compactly represented numerical column. It will have the numbers you need for passing to neural networks.

    Now in an Embedding layer in your keras functional api neural network model, pass the f_code to your model's Input layer. The value in the f_code will be a number now, like int8. The Embedding layer will process it correctly now. Don't pass the original column to the model.

Below are some sample code lines copied out of my project doing exactly the steps above.

all_col_types_readcsv = {'userid':'int32','itemid':'int32','rating':'float32','user_age':'int32','gender':'category','job':'category','zipcode':'category'}

<some code omitted>

d = pd.read_csv(fn, sep='|', header=0, dtype=all_col_types_readcsv, encoding='utf-8', usecols=usecols_readcsv)

<some code omitted>

from pandas.api.types import is_string_dtype
# Select the columns to add code columns to. Numeric cols work fine with Embedding layer so ignore them.

cat_cols = [cn for cn in d.select_dtypes('category')]
print(cat_cols)
str_cols = [cn for cn in d.columns if is_string_dtype(d[cn])]
print(str_cols)
add_code_columns = [cn for cn in d.columns if (cn in cat_cols) and (cn in str_cols)]
print(add_code_columns)

<some code omitted>

# Actually add _code column for the selected columns
for cn in add_code_columns:
  codecolname = cn + "_code"
  if not codecolname in d.columns:
    d[codecolname] = d[cn].cat.codes

You can see the numeric codes pandas made for you:

d.info()
d.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99991 entries, 0 to 99990
Data columns (total 5 columns):
userid      99991 non-null int32
itemid      99991 non-null int32
rating      99991 non-null float32
job         99991 non-null category
job_code    99991 non-null int8
dtypes: category(1), float32(1), int32(2), int8(1)
memory usage: 1.3 MB

Finally, you can omit the job column and retain the job_code column, in this example, for passing into your keras neural network model. Here is some of my model code:

v = Lambda(lambda z: z[:, field_num0_X_cols[cn]], output_shape=(), name="Parser_" + cn)(input_x)
emb_input = Lambda(lambda z: tf.expand_dims(z, axis=-1), output_shape=(1,), name="Expander_" + cn)(v)
a = Embedding(input_dim=num_uniques[cn]+1, output_dim=emb_len[cn], input_length=1, embeddings_regularizer=reg, name="E_" + cn)(emb_input)

By the way, please also wrap np.array() around all pandas dataframes when passing them into model.fit(). It's not well documented and apparnetly also not checked at runtime that pandas dataframes cannot be safely passed in. You get massive memory allocs otherwise which crash hosts.

Upvotes: 1

greeness
greeness

Reputation: 16104

I don't see a way to get what you want using feature columns (I dont see a function named sequence_embedding_column or similar in the available functions in tf.feature_column). Because the result from feature columns seem to be a fixed-length tensor. They achieve that by using a combiner to aggregate individual embedding vectors (sum, mean, sqrtn etc). So the dimension on the sequence of categories are actually lost.

But it's totally doable if you use lower-level apis. First you could construct a lookup table to convert categorical strings to ids.

features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
    vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)

#Content of "fruit.txt"
apple
mango
banana
unknown

Now you could initialize the embedding as a 2d variable. Its shape is [number of categories, embedding dimension].

num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
                "embedding_table", [num_categories, embedding_dim],
                initializer=tf.truncated_normal_initializer(stddev=0.02))

You could then lookup category embedding like below:

ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)

Note the results in ids_embeddings is a concatenated long tensor. Feel free to reshape it to the shape you want.

Upvotes: 1

Related Questions