innuendo
innuendo

Reputation: 373

Scipy sparse matrices are not memory efficient in cosine similarity

I am trying to implement cosine similarity using scipy sparse matrices, as I am getting memory error with the normal matrices (non-sparse). However, I noticed that the memory size (in bytes) of the cosine similarity of sparse and non-sparse matrices is almost the same when the size of the input matrix (observations) is large. Am I doing something wrong, or, is there a way around this? Here's the code where the input has 5% as 1's and 95% as 0's.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
A = np.random.rand(10000, 1000)<.05
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)

# output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)

print("1's percentage", np.count_nonzero(A)/np.size(A))
print('memory percentage', similarities_sparse.data.nbytes/similarities.data.nbytes)

Output of one rune is:

1's percentage 0.0499615
memory percentage 0.91799018

Upvotes: 1

Views: 348

Answers (1)

lfriedl
lfriedl

Reputation: 61

Elaborating @hpaulj's comments into an answer:

Both your calls to cosine_similarity return the same underlying data. That cosine similarity matrix isn't mostly zeros, so using a sparse format doesn't save space.

Input data that's mostly zeros doesn't necessarily (or even typically) yield a cosine similarity matrix that's mostly zeros. Cosine(i,j) = 0 only occurs(*) for a pair of rows (i, j) of the matrix if they have no values in any of the same columns.

(* Or if the dot product otherwise comes out to 0, but that's a side point here.)

Upvotes: 0

Related Questions