Reputation: 19
I have a numpy array of size (96341,1000). I want to find the cosine similarity of this array. The machine I'm working on is 8 vCPU 32 GB. This is my initial code. And I want this function to run faster , can control/limit the amount of memory used and keep the same result.
vec.shape
(96341,1000)
def csm(A):
num = np.dot(A, A.T)
p1 = np.sqrt(np.sum(A**2, axis=1))[:, cp.newaxis]
p2 = np.sqrt(np.sum(A**2, axis=1))[cp.newaxis, :]
result = num / (p1 * p2)
return result
cos = csm(vec)
Upvotes: 0
Views: 616
Reputation: 13316
As @mpw2 said, your numerator is 96341×96341.
But so is your denominator, as is. Which means that at some point, you'll have 3× 74GB arrays in your memory (the one holding the numerator num
, the one holding temporarily the denominator p1*p2
, and the one in making holding the result).
For the denominator, that is so easily solvable, that I can't see any reason no to do it
def csm(A):
num = np.dot(A, A.T)
p1 = np.sqrt(np.sum(A**2, axis=1))[:, cp.newaxis]
p2 = np.sqrt(np.sum(A**2, axis=1))[cp.newaxis, :]
# Only change. Let the broadcasting (which does not allocate
# memory: p1 and p2 are only "implicit" 96341x96341 arrays)
# does the job.
# Note that the parenthesis are useless here, since / are evaluated
# from left to right anyway. I just want it to be explicit
result = (num / p1)/p2
return result
Note also that p1
and p2
are the same arrays, with just a view change. So no need to compute it twice
def csm(A):
num = np.dot(A, A.T)
px = np.sqrt(np.sum(A**2, axis=1))
result = num / px[:,None] / px[None,:]
return result
At last, if we want to avoid those 74 GB (still 2x74 GB with that code: we still have both num and result at one moment both in memory), we can compute by blocks
def csm2(A,B):
num = np.dot(A, B.T)
# this time they are not the same
pa = np.sqrt(np.sum(A**2, axis=1))[:,None]
pb = np.sqrt(np.sum(B**2, axis=1))[None, :]
return num / pa / pb
Then you can compute by block
res=np.empty((len(vec), len(vec)))
N=1000 # Block size (it doesn't need to be a divisor of len(vec). Batch at end of rows or columns will be smaller that's all
for i in range(0, len(vec), N):
for j in range(0, len(vec), N):
res[i:i+N, j:j+N] = csm2(vec[i:i+N], vec[j:j+N])
res[j:j+N, i:i+N] = res[i:i+N, j:j+N].T
Not only we avoid most of the redundant computing because of the symmetry already mentioned.
But we never have arrays bigger than NxN in memory, but for res
itself (but that is unavoidable if that is the result you want).
And, sure, we use for loops, and almost all my answer on [so] mention somewhere that for
loops are the ultimate crime in numpy.
But those are ok, because if N is big enough, then the computation taken by the iterations will be negligible before the computation time in numpy functions.
The last remaining problem being, as I said, the 74 GB of the result (at least, now, there is only one, not 3 as in your solution, not 2 as in my first solution). That, you can avoid in the application, depending on what you do with the result. Try to use only part of the result at a time.
For example, if the idea is find minimum value, you can compute blocks of that matrix, and minimum for each blocks, and compute the min of the min then. That is a classical map/reduce
solution somehow
So, instead of
M=csm(vec)
print(M.min())
do
mn=np.inf
for i in range(0, len(vec), N):
for j in range(i, len(vec), N):
mn=min(mn, csm2(vec[i:i+N], vec[j:j+N]).min())
print(mn)
Of course, I doubt what you want to do with that csm
is as simple as a min. But since you haven't said, I can only speculate that there are ways to do it, while computing csm on one block at a time.
Upvotes: 0