shyam
shyam

Reputation: 1

Need help calculating cosine similarity of a sparse matrix

I'm trying to calculate cosine similarity of a sparse matrix

<63671x30 sparse matrix of type '<class 'numpy.uint8'>'
    with 131941 stored elements in Compressed Sparse Row format>

The thing is I used scikit-learn's cosine_similarity function but I got this error: memoryError: Unable to allocate 29.7 GiB for an array with shape (3984375099,) and data type float64

I googled the error where I was suggested to increase the size of the paging file, but after doing it my PC just freezes and I have to force shutdown and reboot. Is there any way to overcome this?

Upvotes: 0

Views: 1154

Answers (1)

Kartikey Singh
Kartikey Singh

Reputation: 892

Inspiration from: Link

Try doing cosine similarity in chunk wise manner i.e take n number of rows and calculate their cosine similarity with the whole matrix.

from scipy import sparse
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def cosine_similarity_n_space(m1, m2, batch_size=100):
    assert m1.shape[1] == m2.shape[1] and isinstance(batch_size, int) == True

    ret = np.ndarray((m1.shape[0], m2.shape[0]))

    batches = m1.shape[0] // batch_size
    
    if m1.shape[0]%batch_size != 0:
        batches = batches + 1  

    for row_i in range(0, batches):
        start = row_i * batch_size
        end = min([(row_i + 1) * batch_size, m1.shape[0]])        
        rows = m1[start: end]
        sim = cosine_similarity(rows, m2)  
        ret[start: end] = sim
    
    return ret


A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1], [1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
chunk_wise_similarity = cosine_similarity_n_space(A_sparse, A_sparse)

comparison = similarities == chunk_wise_similarity
equal_arrays = comparison.all()

print(equal_arrays)

Upvotes: 1

Related Questions