Reputation: 1
I'm trying to calculate cosine similarity of a sparse matrix
<63671x30 sparse matrix of type '<class 'numpy.uint8'>'
with 131941 stored elements in Compressed Sparse Row format>
The thing is I used scikit-learn's cosine_similarity
function but I got this error:
memoryError: Unable to allocate 29.7 GiB for an array with shape (3984375099,) and data type float64
I googled the error where I was suggested to increase the size of the paging file, but after doing it my PC just freezes and I have to force shutdown and reboot. Is there any way to overcome this?
Upvotes: 0
Views: 1154
Reputation: 892
Inspiration from: Link
Try doing cosine similarity in chunk wise manner i.e take n
number of rows and calculate their cosine similarity with the whole matrix.
from scipy import sparse
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity_n_space(m1, m2, batch_size=100):
assert m1.shape[1] == m2.shape[1] and isinstance(batch_size, int) == True
ret = np.ndarray((m1.shape[0], m2.shape[0]))
batches = m1.shape[0] // batch_size
if m1.shape[0]%batch_size != 0:
batches = batches + 1
for row_i in range(0, batches):
start = row_i * batch_size
end = min([(row_i + 1) * batch_size, m1.shape[0]])
rows = m1[start: end]
sim = cosine_similarity(rows, m2)
ret[start: end] = sim
return ret
A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1], [1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)
chunk_wise_similarity = cosine_similarity_n_space(A_sparse, A_sparse)
comparison = similarities == chunk_wise_similarity
equal_arrays = comparison.all()
print(equal_arrays)
Upvotes: 1