Michael
Michael

Reputation: 371

How to do Multi-hot Encoding but with actual values instead of ones

I am able to perform a Multi-hot encoding of ratings to movies by:

from sklearn.preprocessing import MultiLabelBinarizer


def multihot_encode(actual_values, ordered_possible_values) -> np.array:
    """ Converts a categorical feature with multiple values to a multi-label binary encoding """
    mlb = MultiLabelBinarizer(classes=ordered_possible_values)
    binary_format = mlb.fit_transform(actual_values)
    return binary_format

user_matrix = multihot_encode(lists_of_movieIds, all_movieIds)

where arr_of_movieIds is a batch_size sized list of variable length lists of movie IDs (strings) and all_movieIds are all the possible movie ID strings.

However, instead of just 1 on the resulting matrix I want to have the actual rating that a user gave to the movie. Just like list_of_movieIds I also have access to a "parallel" to that list_of_ratings.

How do I go about doing that efficiently? Is there another MultiLabelBinarizer which takes those as args? Can I do some fancy linear algebra to get there?

I tried to do it like:

user_matrix[user_matrix == 1] = np.concatenate(list_of_ratings)

but the ratings are misplaced because list_of_ratings is not ordered the same way as all_movieIds...

Upvotes: 0

Views: 2376

Answers (1)

mujjiga
mujjiga

Reputation: 16916

Without using MultiLabelBinarizer

import numpy as np
classes=['comedy', 'xyz','thriller', 'sci-fi']
id_dict = {c:i for i,c in enumerate(classes)}
lists_of_movieIds = [{'sci-fi', 'thriller'}, {'comedy'}]
list_of_ratings = [[4,3],[5]]

data = np.zeros((len(lists_of_movieIds), len(classes)))
for i, (m_ids,rs) in enumerate(zip(lists_of_movieIds, list_of_ratings)):
  for m_id,r in zip(m_ids,rs):
    data[i, id_dict[m_id]] = r

print (data)

Output:

[[0. 0. 3. 4.]
 [5. 0. 0. 0.]]

Upvotes: 1

Related Questions