Reputation: 633
Suppose I have a dataframe with several numerical variables and 1 categorical variable with 10000 categories. I use a neural network with Keras to get a matrix of embeddings for the categorical variable. The embedding size is 50 so the matrix that Keras returns has dimension 10002 x 50
.
The extra 2 rows are for unknown categories and the other one I don't know exactly - it's the only way Keras would work, i.e.,
model_i = keras.layers.Embedding(input_dim=num_categories+2, output_dim=embedding_size, input_length=1,
name=f'embedding_{cat_feature}')(input_i)
without the +2
it would not work.
So, I have a training set with ~12M rows and validation set with ~1M rows. Now, the way I thought of reconstructing the embeddings was:
NaN
columns to the data framei
in range(10002) (which is the number of categories + 2) look for the corresponding value of key i
in the reversed dictionary and if it is in the dictionary, using pandas .loc
, replace each row (in those 50 NaN
columns) that correspond to the value of i
(i.e., where the categorical variable is equal to the category name which i
is encoded for) with the corresponding row vector from the 10002 x 50
matrix.The problem with this solution is that it's highly inefficient.
A friend told me about another solution which consists of converting the categorical variable to a one-hot sparse matrix with dimensions 12M x 10000
(for the training set), and then use matrix multiplication with the embeddings matrix which should have dimensions 10000 x 50
thus getting a 12M x 50
matrix which I can then concatenate to my original data frame. The problems here are:
num_categories + 2
) rows in the matrix Keras gives me, instead of 10000. And so again, the dimensions do not match.Does anyone know a better way of doing this or can address the problems in this second approach?
My ultimate goal is having a data frame with all my variables minus the categorical variable and instead, having another 50 columns with the row vectors that represent the embeddings for that categorical variable.
Upvotes: 1
Views: 1306
Reputation: 633
So eventually I found a solution for the second method mentioned in my post. Using sparse matrices avoids the memory issues that might occur when attempting multiplication of matrices with large data (categories and/or observations).
I wrote this function which returns the original data frame with all the desired categorical variables' embedded vectors appended.
def get_embeddings(model: keras.models.Model, cat_vars: List[str], df: pd.DataFrame,
dict: Dict[str, Dict[str, int]]) -> pd.DataFrame:
df_list: List[pd.DataFrame] = [df]
for var_name in cat_vars:
df_1vec: pd.DataFrame = df.loc[:, var_name]
enc = OneHotEncoder()
sparse_mat = enc.fit_transform(df_1vec.values.reshape(-1, 1))
sparse_mat = sparse.csr_matrix(sparse_mat, dtype='uint8')
orig_dict = dict[var_name]
match_to_arr = np.empty(
(sparse_mat.shape[1], model.get_layer(f'embedding_{var_name}').get_weights()[0].shape[1]))
match_to_arr[:] = np.nan
unknown_cat = model.get_layer(f'embedding_{var_name}').get_weights()[0].shape[0] - 1
for i, col in enumerate(tqdm.tqdm(enc.categories_[0])):
if col in orig_dict.keys():
val = orig_dict[col]
match_to_arr[i, :] = model.get_layer(f'embedding_{var_name}').get_weights()[0][val, :]
else:
match_to_arr[i, :] = (model.get_layer(f'embedding_{var_name}')
.get_weights()[0][unknown_cat, :])
a = sparse_mat.dot(match_to_arr)
a = pd.DataFrame(a, columns=[f'{var_name}_{i}' for i in range(1, match_to_arr.shape[1] + 1)])
df_list.append(a)
df_final = pd.concat(df_list, axis=1)
return df_final
dict
is a dictionary of dictionaries, i.e., holding a dictionary for each categorical variable which I encoded beforehand with keys being the category names and the values integers. Note that each category was encoded with num_values + 1
with the last being reserved for unknown categories.
Basically what I am doing is asking for each category value if it is in the dictionary. If it is, I assign the corresponding row in a temporary array (so if this is the first category then the first row) to the corresponding row in the embedding matrix where the row number corresponds to the value for which the category name was encoded to.
If it is not in the dictionary I then assign to this row (this = i
th row) the last row in the embedding matrix which corresponds to unknown categories.
Upvotes: 2
Reputation: 22031
this is what I introduced in the comments
df = pd.DataFrame({'int':np.random.uniform(0,1, 10),'cat':np.random.randint(0,333, 10)}) # cat are encoded
## define embedding model, you can also use multiple input source
inp = Input((1))
emb = Embedding(input_dim=10000+2, output_dim=50, name='embedding')(inp)
out = Dense(10)(emb)
model = Model(inp, out)
# model.compile(...)
# model.fit(...)
## get cat embeddings
extractor = Model(model.input, Flatten()(model.get_layer('embedding').output))
## concat embedding in the orgiginal df
df = pd.concat([df, pd.DataFrame(extractor.predict(df.cat.values))], axis=1)
df
Upvotes: 0