Extracting embeddings of categorical features back to original data frame in Python

Question

Suppose I have a dataframe with several numerical variables and 1 categorical variable with 10000 categories. I use a neural network with Keras to get a matrix of embeddings for the categorical variable. The embedding size is 50 so the matrix that Keras returns has dimension 10002 x 50.
The extra 2 rows are for unknown categories and the other one I don't know exactly - it's the only way Keras would work, i.e.,

model_i = keras.layers.Embedding(input_dim=num_categories+2, output_dim=embedding_size, input_length=1,
                            name=f'embedding_{cat_feature}')(input_i)

without the +2 it would not work.

So, I have a training set with ~12M rows and validation set with ~1M rows. Now, the way I thought of reconstructing the embeddings was:

having a reversed dictionary with numerical values (which were encoded before to represent the categories) as keys and the category names as values
Add 50 NaN columns to the data frame
for i in range(10002) (which is the number of categories + 2) look for the corresponding value of key i in the reversed dictionary and if it is in the dictionary, using pandas .loc, replace each row (in those 50 NaN columns) that correspond to the value of i (i.e., where the categorical variable is equal to the category name which i is encoded for) with the corresponding row vector from the 10002 x 50 matrix.

The problem with this solution is that it's highly inefficient.
A friend told me about another solution which consists of converting the categorical variable to a one-hot sparse matrix with dimensions 12M x 10000 (for the training set), and then use matrix multiplication with the embeddings matrix which should have dimensions 10000 x 50 thus getting a 12M x 50 matrix which I can then concatenate to my original data frame. The problems here are:

It won't work on the validation set because the number of categories appearing there is or may be different than in training, so the dimensions do not match.
Even when used on the training set, I have 10002 (=num_categories + 2) rows in the matrix Keras gives me, instead of 10000. And so again, the dimensions do not match.

Does anyone know a better way of doing this or can address the problems in this second approach?
My ultimate goal is having a data frame with all my variables minus the categorical variable and instead, having another 50 columns with the row vectors that represent the embeddings for that categorical variable.

Corel · Accepted Answer

So eventually I found a solution for the second method mentioned in my post. Using sparse matrices avoids the memory issues that might occur when attempting multiplication of matrices with large data (categories and/or observations).
I wrote this function which returns the original data frame with all the desired categorical variables' embedded vectors appended.

def get_embeddings(model: keras.models.Model, cat_vars: List[str], df: pd.DataFrame,
                   dict: Dict[str, Dict[str, int]]) -> pd.DataFrame:

    df_list: List[pd.DataFrame] = [df]

    for var_name in cat_vars:
        df_1vec: pd.DataFrame = df.loc[:, var_name]
        enc = OneHotEncoder()
        sparse_mat = enc.fit_transform(df_1vec.values.reshape(-1, 1))
        sparse_mat = sparse.csr_matrix(sparse_mat, dtype='uint8')

        orig_dict = dict[var_name]

        match_to_arr = np.empty(
            (sparse_mat.shape[1], model.get_layer(f'embedding_{var_name}').get_weights()[0].shape[1]))
        match_to_arr[:] = np.nan

        unknown_cat = model.get_layer(f'embedding_{var_name}').get_weights()[0].shape[0] - 1

        for i, col in enumerate(tqdm.tqdm(enc.categories_[0])):
            if col in orig_dict.keys():
                val = orig_dict[col]
                match_to_arr[i, :] = model.get_layer(f'embedding_{var_name}').get_weights()[0][val, :]
            else:
                match_to_arr[i, :] = (model.get_layer(f'embedding_{var_name}')
                                                .get_weights()[0][unknown_cat, :])

        a = sparse_mat.dot(match_to_arr)
        a = pd.DataFrame(a, columns=[f'{var_name}_{i}' for i in range(1, match_to_arr.shape[1] + 1)])
        df_list.append(a)

    df_final = pd.concat(df_list, axis=1)
    return df_final

dict is a dictionary of dictionaries, i.e., holding a dictionary for each categorical variable which I encoded beforehand with keys being the category names and the values integers. Note that each category was encoded with num_values + 1 with the last being reserved for unknown categories.

Basically what I am doing is asking for each category value if it is in the dictionary. If it is, I assign the corresponding row in a temporary array (so if this is the first category then the first row) to the corresponding row in the embedding matrix where the row number corresponds to the value for which the category name was encoded to. If it is not in the dictionary I then assign to this row (this = ith row) the last row in the embedding matrix which corresponds to unknown categories.

Extracting embeddings of categorical features back to original data frame in Python

Answers (2)

Related Questions