Reputation: 4470
I want to get the numpy vectors created using the "Embedding Column" in Tensorflow.
For example, creating a sample DF:
sample_column1 = ["Apple","Apple","Mango","Apple","Banana","Mango","Mango","Banana","Banana"]
sample_column2 = [1,2,1,3,4,6,2,1,3]
ds = pd.DataFrame(sample_column1,columns=["A"])
ds["B"] = sample_column2
ds
Converting the pandas DF to Tensorflow object
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('B')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
#print (ds)
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
#print (ds)
ds = ds.batch(batch_size)
return ds
Creating a embedding column:
tf_ds = df_to_dataset(ds)
# embedding cols
col_a = feature_column.categorical_column_with_vocabulary_list(
'A', ['Apple', 'Mango', 'Banana'])
col_a_embedding = feature_column.embedding_column(col_a, dimension=8)
Is there anyway to get the embeddings as numpy vectors from the 'col_a_embedding' object?
Example,
The category "Apple" will be embedded into a vector size 8:
[a1 a2 a3 a4 a5 a6 a7 a8]
Can we fetch that vector?
Upvotes: 1
Views: 2445
Reputation: 1564
I suggest the easiest fastest way is to do like this, which is what I am doing in my own app:
Use pandas to read_csv your file into a string column of type "category" in pandas using the dtype parameter. Let's call it field "f". This is the original string column, not a numerical column yet.
Still in pandas, create a new column and copy the original column's pandas cat.codes into the new column. Let's call it field "f_code". Pandas automatically encodes this into a compactly represented numerical column. It will have the numbers you need for passing to neural networks.
Now in an Embedding layer in your keras functional api neural network model, pass the f_code to your model's Input layer. The value in the f_code will be a number now, like int8. The Embedding layer will process it correctly now. Don't pass the original column to the model.
Below are some sample code lines copied out of my project doing exactly the steps above.
all_col_types_readcsv = {'userid':'int32','itemid':'int32','rating':'float32','user_age':'int32','gender':'category','job':'category','zipcode':'category'}
<some code omitted>
d = pd.read_csv(fn, sep='|', header=0, dtype=all_col_types_readcsv, encoding='utf-8', usecols=usecols_readcsv)
<some code omitted>
from pandas.api.types import is_string_dtype
# Select the columns to add code columns to. Numeric cols work fine with Embedding layer so ignore them.
cat_cols = [cn for cn in d.select_dtypes('category')]
print(cat_cols)
str_cols = [cn for cn in d.columns if is_string_dtype(d[cn])]
print(str_cols)
add_code_columns = [cn for cn in d.columns if (cn in cat_cols) and (cn in str_cols)]
print(add_code_columns)
<some code omitted>
# Actually add _code column for the selected columns
for cn in add_code_columns:
codecolname = cn + "_code"
if not codecolname in d.columns:
d[codecolname] = d[cn].cat.codes
You can see the numeric codes pandas made for you:
d.info()
d.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99991 entries, 0 to 99990
Data columns (total 5 columns):
userid 99991 non-null int32
itemid 99991 non-null int32
rating 99991 non-null float32
job 99991 non-null category
job_code 99991 non-null int8
dtypes: category(1), float32(1), int32(2), int8(1)
memory usage: 1.3 MB
Finally, you can omit the job column and retain the job_code column, in this example, for passing into your keras neural network model. Here is some of my model code:
v = Lambda(lambda z: z[:, field_num0_X_cols[cn]], output_shape=(), name="Parser_" + cn)(input_x)
emb_input = Lambda(lambda z: tf.expand_dims(z, axis=-1), output_shape=(1,), name="Expander_" + cn)(v)
a = Embedding(input_dim=num_uniques[cn]+1, output_dim=emb_len[cn], input_length=1, embeddings_regularizer=reg, name="E_" + cn)(emb_input)
By the way, please also wrap np.array() around all pandas dataframes when passing them into model.fit(). It's not well documented and apparnetly also not checked at runtime that pandas dataframes cannot be safely passed in. You get massive memory allocs otherwise which crash hosts.
Upvotes: 1
Reputation: 16104
I don't see a way to get what you want using feature columns (I dont see a function named sequence_embedding_column
or similar in the available functions in tf.feature_column
). Because the result from feature columns seem to be a fixed-length tensor. They achieve that by using a combiner to aggregate individual embedding vectors (sum, mean, sqrtn etc). So the dimension on the sequence of categories are actually lost.
But it's totally doable if you use lower-level apis. First you could construct a lookup table to convert categorical strings to ids.
features = tf.constant(["apple", "banana", "apple", "mango"])
table = tf.lookup.index_table_from_file(
vocabulary_file="fruit.txt", num_oov_buckets=1)
ids = table.lookup(features)
#Content of "fruit.txt"
apple
mango
banana
unknown
Now you could initialize the embedding as a 2d variable. Its shape is [number of categories, embedding dimension]
.
num_categories = 3
embedding_dim = 64
category_emb = tf.get_variable(
"embedding_table", [num_categories, embedding_dim],
initializer=tf.truncated_normal_initializer(stddev=0.02))
You could then lookup category embedding like below:
ids_embeddings = tf.nn.embedding_lookup(category_emb, ids)
Note the results in ids_embeddings
is a concatenated long tensor. Feel free to reshape
it to the shape you want.
Upvotes: 1