Reputation: 371
I am able to perform a Multi-hot encoding of ratings to movies by:
from sklearn.preprocessing import MultiLabelBinarizer
def multihot_encode(actual_values, ordered_possible_values) -> np.array:
""" Converts a categorical feature with multiple values to a multi-label binary encoding """
mlb = MultiLabelBinarizer(classes=ordered_possible_values)
binary_format = mlb.fit_transform(actual_values)
return binary_format
user_matrix = multihot_encode(lists_of_movieIds, all_movieIds)
where arr_of_movieIds
is a batch_size sized list of variable length lists of movie IDs (strings) and all_movieIds
are all the possible movie ID strings.
However, instead of just 1 on the resulting matrix I want to have the actual rating that a user gave to the movie. Just like list_of_movieIds
I also have access to a "parallel" to that list_of_ratings
.
How do I go about doing that efficiently? Is there another MultiLabelBinarizer which takes those as args? Can I do some fancy linear algebra to get there?
I tried to do it like:
user_matrix[user_matrix == 1] = np.concatenate(list_of_ratings)
but the ratings are misplaced because list_of_ratings
is not ordered the same way as all_movieIds
...
Upvotes: 0
Views: 2376
Reputation: 16916
Without using MultiLabelBinarizer
import numpy as np
classes=['comedy', 'xyz','thriller', 'sci-fi']
id_dict = {c:i for i,c in enumerate(classes)}
lists_of_movieIds = [{'sci-fi', 'thriller'}, {'comedy'}]
list_of_ratings = [[4,3],[5]]
data = np.zeros((len(lists_of_movieIds), len(classes)))
for i, (m_ids,rs) in enumerate(zip(lists_of_movieIds, list_of_ratings)):
for m_id,r in zip(m_ids,rs):
data[i, id_dict[m_id]] = r
print (data)
Output:
[[0. 0. 3. 4.]
[5. 0. 0. 0.]]
Upvotes: 1