How to do Multi-hot Encoding but with actual values instead of ones

Question

I am able to perform a Multi-hot encoding of ratings to movies by:

from sklearn.preprocessing import MultiLabelBinarizer


def multihot_encode(actual_values, ordered_possible_values) -> np.array:
    """ Converts a categorical feature with multiple values to a multi-label binary encoding """
    mlb = MultiLabelBinarizer(classes=ordered_possible_values)
    binary_format = mlb.fit_transform(actual_values)
    return binary_format

user_matrix = multihot_encode(lists_of_movieIds, all_movieIds)

where arr_of_movieIds is a batch_size sized list of variable length lists of movie IDs (strings) and all_movieIds are all the possible movie ID strings.

However, instead of just 1 on the resulting matrix I want to have the actual rating that a user gave to the movie. Just like list_of_movieIds I also have access to a "parallel" to that list_of_ratings.

How do I go about doing that efficiently? Is there another MultiLabelBinarizer which takes those as args? Can I do some fancy linear algebra to get there?

I tried to do it like:

user_matrix[user_matrix == 1] = np.concatenate(list_of_ratings)

but the ratings are misplaced because list_of_ratings is not ordered the same way as all_movieIds...

How to do Multi-hot Encoding but with actual values instead of ones

Answers (1)

Related Questions