Reputation: 73
I have a .tsv file from which I've created a pyhton dictionary where the keys are all the movie_id and the values are the features (every movie has a different number of features).
From this dictionary I want to create an item-features sparse matrix to use for a recommender system project. At the end I would like to have a binary sparse matrix with 1 when a movie has a certain feature. Something like this:
def Dictionary():
d={}
l=[]
with open(filepath_mapping) as f:
for line in f.readlines():
line = line.split()
key = int(line[0])
value = [int(el) for el in line[1:]]
d[key] = value
return(d)
movie_features_dict = Dictionary()
n = len(movie_features_dict)
value_lengths = [len(v) for v in movie_features_dict.values()]
d = max(value_lengths)
print(f"ITEM*FEATURES matrix shape: {n,d}\n")
item_feature_matrix = sp.dok_matrix((n,d), dtype=np.int8)
for movie_ids, features in movie_features_dict.items():
item_feature_matrix[movie_ids, features] = 1
item_feature_matrix = item_feature_matrix.tocsr()
print(item_feature_matrix.shape)
I have 22069 movies and the movie with the maximum number of features should have 885 features, so theoretically I should have a 22069*885 matrix, but with the code I've written I continue having this error:
raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (614734) out of range
Upvotes: 1
Views: 460
Reputation: 73
I write this answer for future users that will have a similar problem.
As I said in comments to other answers above, the creation of a new pandas dataframe was not useful for my needs, so this is the solution I've implemented.
Based on this answer I've created the sparse matrix in this way:
from sklearn.feature_extraction import DictVectorizer
restructured = []
for key in movie_features_dict:
data_dict = {}
for feat in movie_features_dict[key]:
data_dict[feat] = 1
restructured.append(data_dict)
dictvectorizer = DictVectorizer(sparse=True)
matrix_item_features = dictvectorizer.fit_transform(restructured)
print(f"Item-feature matrix shape: {matrix_item_features.shape}")
You can take a view here and here to have a better understanding of how DictVectorizer
works.
Upvotes: 2
Reputation: 2532
Based on this answer, you can do the following with few lines of code:
import pandas as pd
id_to_features = {
880: [18, 23, 854, 98475, 20],
152: [1, 578, 18, 654, 23, 5, 11],
6654: [2088]
}
df = pd.DataFrame({"features": list(id_to_features.values())})
matrix = df['features'].apply(pd.value_counts).fillna(0).astype(int)
ids = list(id_to_features.keys())
matrix.index = ids
matrix = matrix.reindex(sorted(matrix.columns), axis=1)
EDIT
Out of curiosity, I have created a fake dataset and the code above took 7 seconds to run (using perf_counter
) on a common laptop.
Here is the code for generating the dataset:
id_to_features = {
i: [randint(1, 886) for _ in range(randint(1, 10))] for i in range(1, 22070)
}
The resulting matrix
requires 78 MB of space computed using
matrix.memory_usage(index=True, deep=True).sum()
considering instead astype("int8")
, it requires 20 MB.
Upvotes: 1