Nared Fuengverojsakul
Nared Fuengverojsakul

Reputation: 19

I want to improve the efficiency of cosine similarity calculation to make it faster

I have a numpy array of size (96341,1000). I want to find the cosine similarity of this array. The machine I'm working on is 8 vCPU 32 GB. This is my initial code. And I want this function to run faster , can control/limit the amount of memory used and keep the same result.

vec.shape
(96341,1000)

def csm(A):
    
    num = np.dot(A, A.T)  
    p1 = np.sqrt(np.sum(A**2, axis=1))[:, cp.newaxis]  
    p2 = np.sqrt(np.sum(A**2, axis=1))[cp.newaxis, :]  

    result = num / (p1 * p2)  
    
    return result

cos = csm(vec)

Upvotes: 0

Views: 616

Answers (1)

chrslg
chrslg

Reputation: 13316

As @mpw2 said, your numerator is 96341×96341. But so is your denominator, as is. Which means that at some point, you'll have 3× 74GB arrays in your memory (the one holding the numerator num, the one holding temporarily the denominator p1*p2, and the one in making holding the result).

For the denominator, that is so easily solvable, that I can't see any reason no to do it

def csm(A):
    
    num = np.dot(A, A.T)  
    p1 = np.sqrt(np.sum(A**2, axis=1))[:, cp.newaxis]  
    p2 = np.sqrt(np.sum(A**2, axis=1))[cp.newaxis, :]  

    # Only change. Let the broadcasting (which does not allocate 
    # memory: p1 and p2 are only "implicit" 96341x96341 arrays)
    # does the job.
    # Note that the parenthesis are useless here, since / are evaluated
    # from left to right anyway. I just want it to be explicit  
    result = (num / p1)/p2
    
    return result

Note also that p1 and p2 are the same arrays, with just a view change. So no need to compute it twice

def csm(A):
    
    num = np.dot(A, A.T)  
    px = np.sqrt(np.sum(A**2, axis=1))
    result = num / px[:,None] / px[None,:]
    
    return result

At last, if we want to avoid those 74 GB (still 2x74 GB with that code: we still have both num and result at one moment both in memory), we can compute by blocks

def csm2(A,B):
    num = np.dot(A, B.T)  
    # this time they are not the same
    pa = np.sqrt(np.sum(A**2, axis=1))[:,None]
    pb = np.sqrt(np.sum(B**2, axis=1))[None, :]
    return num / pa / pb

Then you can compute by block

res=np.empty((len(vec), len(vec)))
N=1000 # Block size (it doesn't need to be a divisor of len(vec). Batch at end of rows or columns will be smaller that's all
for i in range(0, len(vec), N):
    for j in range(0, len(vec), N):
        res[i:i+N, j:j+N] = csm2(vec[i:i+N], vec[j:j+N])
        res[j:j+N, i:i+N] = res[i:i+N, j:j+N].T

Not only we avoid most of the redundant computing because of the symmetry already mentioned. But we never have arrays bigger than NxN in memory, but for res itself (but that is unavoidable if that is the result you want).

And, sure, we use for loops, and almost all my answer on [so] mention somewhere that for loops are the ultimate crime in numpy. But those are ok, because if N is big enough, then the computation taken by the iterations will be negligible before the computation time in numpy functions.

The last remaining problem being, as I said, the 74 GB of the result (at least, now, there is only one, not 3 as in your solution, not 2 as in my first solution). That, you can avoid in the application, depending on what you do with the result. Try to use only part of the result at a time.

For example, if the idea is find minimum value, you can compute blocks of that matrix, and minimum for each blocks, and compute the min of the min then. That is a classical map/reduce solution somehow

So, instead of

M=csm(vec)
print(M.min())

do

mn=np.inf
for i in range(0, len(vec), N):
    for j in range(i, len(vec), N):
        mn=min(mn, csm2(vec[i:i+N], vec[j:j+N]).min())
print(mn)

Of course, I doubt what you want to do with that csm is as simple as a min. But since you haven't said, I can only speculate that there are ways to do it, while computing csm on one block at a time.

Upvotes: 0

Related Questions